2014-02-13 118 views
0

如何刪除嵌套標記中的內容BeautifulSoup?這些職位表現出相反的檢索中嵌套的標籤內容:How to get contents of nested tag using BeautifulSoup,並BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?如何使用BeautifulSoup刪除嵌套標記中的內容?

我試圖.text,但它僅刪除標籤

>>> from bs4 import BeautifulSoup as bs 
>>> html = "<foo>Something something <bar> blah blah</bar> something</foo>" 
>>> bs(html).find_all('foo')[0] 
<foo>Something something <bar> blah blah</bar> something else</foo> 
>>> bs(html).find_all('foo')[0].text 
u'Something something blah blah something else' 

所需的輸出:

東西什麼東西否則

+0

那麼......在這個例子中,你想刪除'bar'的內容嗎? –

+0

在第二行代碼中是否應該有「else」? –

回答

2

您可以檢查bs4.element.NavigableString兒童:

from bs4 import BeautifulSoup as bs 
import bs4 
html = "<foo>Something something <bar> blah blah</bar> something <bar2>GONE!</bar2> else</foo>" 
def get_only_text(elem): 
    for item in elem.children: 
     if isinstance(item,bs4.element.NavigableString): 
      yield item 

print ''.join(get_only_text(bs(html).find_all('foo')[0])) 

輸出;

Something something something else 
0

例如,

body = bs(html) 
for tag in body.find_all('bar'): 
    tag.replace_with('') 
相關問題