在BeautifulSoup中查找標籤和文本

我在制定針對BeautifulSoup的findAll查詢時會遇到一些麻煩，該查詢會按照我的要求進行。以前，我使用findAll來從一些html中只提取文本，實質上是去除了所有的標籤。舉例來說，如果我有：在BeautifulSoup中查找標籤和文本

<b>Cows</b> are being abducted by aliens according to the 
<a href="www.washingtonpost.com>Washington Post</a>.

這將減少到：

Cows are being abducted by aliens according to the Washington Post.

我會做這個使用''.join(html.findAll(text=True))。這工作很好，直到我決定只保留<a>標籤，但將其餘標籤剝離。因此，考慮到最初的例子，我將結束了這一點：

Cows are being abducted by aliens according to the 
<a href="www.washingtonpost.com>Washington Post</a>.

我最初以爲下面會做的伎倆：

''.join(html.findAll({'a':True}, text=True))

然而，這不起作用，因爲text=True似乎表明它只會找到文本。我需要的是一些OR選項 - 我想找到文本或<a>標籤。標籤保留在他們正在標記的文本週圍是很重要的 - 我不能讓標籤或文本出現亂序。

有什麼想法？

來源

2011-08-07 cryptic_star

注意：BeautifulSoup.findAll是一個搜索API。 findAll的第一個命名參數是name可用於將搜索限制爲給定的一組標籤。僅使用一個findAll就不能選擇標籤之間的所有文本，並且同時選擇<a>的文本和標籤。所以我想出了下面的解決方案。

此解決方案取決於導入的BeautifulSoup.Tag。

from BeautifulSoup import BeautifulSoup, Tag 

soup = BeautifulSoup('<b>Cows</b> are being abducted by aliens according to the <a href="www.washingtonpost.com>Washington Post</a>.') 
parsed_soup = ''

我們像使用contents方法的列表一樣瀏覽解析樹。我們僅在標籤爲標籤並且標籤不是<a>時才提取文本。否則，我們會得到包含標籤的整個字符串。這使用navigating the parse tree API。

for item in soup.contents: 
    if type(item) is Tag and u'a' != item.name: 
     parsed_soup += ''.join(item.findAll(text = True)) 
    else: 
     parsed_soup += unicode(item)

文字的順序被保存。

>>> print parsed_soup 
u'Cows are being abducted by aliens according to the <a href=\'"www.washingtonpost.com\'>Washington Post</a>.'

來源

2011-08-07 21:47:10

非常感謝！除了使用'findAll'方法之外，我對導航解析樹還不是很熟悉。 –

在BeautifulSoup中查找標籤和文本

回答

相關問題