從html中提取所有表格和h4

我有一個html文件，我想從中提取所有表格和h4元素。那就是我想從文件中只取表和h4，並在其他地方使用它。我使用Notepad ++並尋找一些pythonscript來做到這一點。從html中提取所有表格和h4

<html> 
// header 
<body> 
    <div> 
    <h4></h4> 
    <h4></h4> 
    <table> 
    // some rows with cells here 
    </table> 
    // maybe some content here 
    <table> 
    // a form and other stuff 
    </table> 
    // probably some more text 
</div> 
</body> 
</html>

感謝

來源

2014-02-07 swapna

到目前爲止，你做了什麼？ – svenwltr

我建議使用模塊BeautifulSoup。

你可以完成你想要做：

from bs4 import BeautifulSoup 

    code = file("file.html") 
    html = code.read() 
    soup = BeautifulSoup(html) 
    htag = soup.findall('h4') 
    tabletag = soup.findall('table') 
    for h in htag: 
     print h.text 
    for table in tabletag: 
     print table.text

來源

2014-02-07 14:51:13 RydallCooper