如何使用BeautifulSoup在兩個指定標籤之間獲取所有文本？

html = """ 
... 
<tt class="descname">all</tt> 
<big>(</big> 
<em>iterable</em> 
<big>)</big> 
<a class="headerlink" href="#all" title="Permalink to this definition">¶</a> 
... 
"""

我想要在起始標記big到第一個出現a標記之間的所有文本。這意味着如果我拿這個例子，那麼我必須得到(iterable)作爲一個字符串。如何使用BeautifulSoup在兩個指定標籤之間獲取所有文本？

來源

2012-08-04 Amit Yadav

我會避免nextSibling，因爲從你的問題，你要包括一切，直到接下來的<a>，不管這是否在兄弟姐妹，父母或孩子的元素。

因此，我認爲最好的方法是找到節點，這是下一個<a>元素和循環遞歸直到那時，添加每個字符串遇到。您可能需要整理的下方，如果你的HTML是從樣品千差萬別的，但這樣的事情應該工作：

from bs4 import BeautifulSoup 
#by taking the `html` variable from the question. 
html = BeautifulSoup(html) 
firstBigTag = html.find_all('big')[0] 
nextATag = firstBigTag.find_next('a') 
def loopUntilA(text, firstElement): 
    text += firstElement.string 
    if (firstElement.next.next == nextATag):    
     return text 
    else: 
     #Using double next to skip the string nodes themselves 
     return loopUntilA(text, firstElement.next.next) 
targetString = loopUntilA('', firstBigTag) 
print targetString

來源

2012-08-04 13:59:44 anotherdave

是的，沒錯，我想包括一切到下一個標記「a」，並且可能有任何數量的標記，第一個「大」標記和第一個「a」標記之間的文本 – 2012-08-04 14:37:38

>>> from BeautifulSoup import BeautifulSoup as bs 
>>> parsed = bs(html) 
>>> txt = [] 
>>> for i in parsed.findAll('big'): 
...  txt.append(i.text) 
...  if i.nextSibling.name != u'a': 
...   txt.append(i.nextSibling.text) 
... 
>>> ''.join(txt) 
u'(iterable)'

來源

2012-08-04 13:11:30

'nextiSbling'不能被用作我想包括所有的文本標籤高達第一次出現的「A」 – 2012-08-04 14:39:17

，你可以這樣做：

from BeautifulSoup import BeautifulSoup 
html = """ 
<tt class="descname">all</tt> 
<big>(</big> 
<em>iterable</em> 
<big>)</big> 
<a class="headerlink" href="test" title="Permalink to this definition"></a> 
""" 
soup = BeautifulSoup(html) 
print soup.find('big').nextSibling.next.text

詳細檢查DOM與BeautifulSoup遍歷從here

來源

2012-08-04 13:47:39 mushfiq

這將返回「迭代」，而不是「（迭代器）」 – anotherdave 2012-08-04 14:02:18

一個迭代的方法。

from BeautifulSoup import BeautifulSoup as bs 
from itertools import takewhile, chain 

def get_text(html, from_tag, until_tag): 
    soup = bs(html) 
    for big in soup(from_tag): 
     until = big.findNext(until_tag) 
     strings = (node for node in big.nextSiblingGenerator() if getattr(node, 'text', '').strip()) 
     selected = takewhile(lambda node: node != until, strings) 
     try: 
      yield ''.join(getattr(node, 'text', '') for node in chain([big, next(selected)], selected)) 
     except StopIteration as e: 
      pass 

for text in get_text(html, 'big', 'a'): 
    print text

來源

2012-08-04 14:25:40

如何使用BeautifulSoup在兩個指定標籤之間獲取所有文本？

回答

相關問題