使用beautifulsoup提取（如 標籤）換行之間的文本

我有以下的HTML是我目前使用BeautifulSoup獲得HTML中的其他元素更大的文件使用beautifulsoup提取（如 標籤）換行之間的文本

<br /> 
Important Text 1 
<br /> 
<br /> 
Not Important Text 
<br /> 
Important Text 2 
<br /> 
Important Text 3 
<br /> 
<br /> 
Non Important Text 
<br /> 
Important Text 4 
<br />

內，但我一直沒能找到一種方法來獲取 標籤之間的重要文本行。我可以分離並導航到每個 元素，但無法找到在兩者之間獲取文本的方法。任何幫助將不勝感激。謝謝。

來源

2011-03-11 maltman

如果你只是想這是兩個 標籤之間的任何文本，你可以這樣做以下：

from BeautifulSoup import BeautifulSoup, NavigableString, Tag 

input = '''<br /> 
Important Text 1 
<br /> 
<br /> 
Not Important Text 
<br /> 
Important Text 2 
<br /> 
Important Text 3 
<br /> 
<br /> 
Non Important Text 
<br /> 
Important Text 4 
<br />''' 

soup = BeautifulSoup(input) 

for br in soup.findAll('br'): 
    next_s = br.nextSibling 
    if not (next_s and isinstance(next_s,NavigableString)): 
     continue 
    next2_s = next_s.nextSibling 
    if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br': 
     text = str(next_s).strip() 
     if text: 
      print "Found:", next_s

但是，也許我誤解你的問題？你的問題的說明似乎並沒有與「重要」 /「非重要」在你的例子數據，所以我已經與描述匹配;）

來源

2011-03-11 17:00:28

啊，問題是我是用findNextSibling（），以及剛跳過文本並進入下一個換行符。使用nextSibling工作。謝謝您的幫助！ – maltman 2011-03-14 15:22:29

很好的回答，這讓我很頭疼！ – Nick 2013-07-24 01:58:41

'next'不是Python中的保留字嗎？也許不同的變量名會更好？（這是一個小點，但這樣的東西加起來！） – duhaime 2013-10-18 02:20:50

所以，用於測試目的，讓我們假設該段HTML是span標籤中：

x = """<span><br /> 
Important Text 1 
<br /> 
<br /> 
Not Important Text 
<br /> 
Important Text 2 
<br /> 
Important Text 3 
<br /> 
<br /> 
Non Important Text 
<br /> 
Important Text 4 
<br /></span>"""

現在我要分析它，並找到我的跨度標籤：

from BeautifulSoup import BeautifulSoup 
y = soup.find('span')

如果您遍歷在y.childGenerator()發電機，你會得到br和文本：

In [4]: for a in y.childGenerator(): print type(a), str(a) 
    ....: 
<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Important Text 1 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Not Important Text 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Important Text 2 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Important Text 3 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Non Important Text 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Important Text 4 

<type 'instance'> <br />

來源

2011-03-11 17:01:44

以下爲我工作：

for br in soup.findAll('br'): 
    if str(type(br.contents[0])) == '<class \'BeautifulSoup.NavigableString\'>': 
     print br.contents[0]

來源

2016-02-02 16:59:20 Pontios

請不要依賴代碼邏輯的對象的字符串表示。 – Sylvain 2017-05-05 10:13:07

使用beautifulsoup提取（如<br />標籤）換行之間的文本

回答

相關問題