2013-08-16 46 views
2

我正在處理帶有子標記的HTML元素,我想要「忽略」或刪除它,以便文本仍然存在。剛纔,如果我嘗試.string任何帶有標籤的元素,我所得到的全部是None如何獲取美麗的湯元素的.string時忽略標籤?

import bs4 

soup = bs4.BeautifulSoup(""" 
    <div id="main"> 
     <p>This is a paragraph.</p> 
     <p>This is a paragraph <span class="test">with a tag</span>.</p> 
     <p>This is another paragraph.</p> 
    </div> 
""") 

main = soup.find(id='main') 
for child in main.children: 
    print child.string 

輸出:

This is a paragraph. 
None 
This is another paragraph. 

我想第二行是This is a paragraph with a tag.。我該怎麼做呢?

回答

4
for child in soup.find(id='main'): 
    if isinstance(child, bs4.Tag): 
     print child.text 

而且,你會得到:

This is a paragraph. 
This is a paragraph with a tag. 
This is another paragraph. 
0

改爲使用.strings iterable。使用''.join()在所有字符串拉一起加入他們的行列:

print ''.join(main.strings) 

遍歷.strings產生每個包含串,直接或子標籤。

演示:

>>> print ''.join(main.strings) 

This is a paragraph. 
This is a paragraph with a tag. 
This is another paragraph.