從HTML文檔中過濾所有內部文本

我想要一個大型的HTML文檔，並且我想刪除所有標籤之間的所有內部文本。我似乎找到的所有內容都是從HTML中提取文本。我想要的是原始HTML標籤的屬性完好無損。如何過濾出文本？從HTML文檔中過濾所有內部文本

2014-03-31 ATMA

查找每個文本元素與soup.find_all(text=True)所有文字，.extract()從文件中刪除：

for textelement in soup.find_all(text=True): 
    textelement.extract()

演示：

>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup('''\ 
... <html><body><p>Hello world!<p> 
... <div><ul><li>This is all 
... </li><li>Set to go!</li></ul></div> 
... </body></html>''') 
>>> soup 
<html><body><p>Hello world!</p><p> 
</p><div><ul><li>This is all 
</li><li>Set to go!</li></ul></div> 
</body></html> 
>>> for textelement in soup.find_all(text=True): 
...  textelement.extract() 
... 
u'Hello world!' 
u'\n' 
u'This is all\n' 
u'Set to go!' 
u'\n' 
>>> print soup.prettify() 
<html> 
<body> 
    <p> 
    </p> 
    <p> 
    </p> 
    <div> 
    <ul> 
    <li> 
    </li> 
    <li> 
    </li> 
    </ul> 
    </div> 
</body> 
</html>

來源

2014-03-31 20:41:14

從HTML文檔中過濾所有內部文本

回答

相關問題