元素之間的Python BeautifulSoup提取文本

我嘗試提取「這是我TEXT」從下面的HTML：元素之間的Python BeautifulSoup提取文本

<html> 
<body> 
<table> 
    <td class="MYCLASS"> 
     <!-- a comment --> 
     <a hef="xy">Text</a> 
     <p>something</p> 
     THIS IS MY TEXT 
     <p>something else</p> 
     </br> 
    </td> 
</table> 
</body> 
</html>

我試着這樣說：

soup = BeautifulSoup(html) 

for hit in soup.findAll(attrs={'class' : 'MYCLASS'}): 
    print hit.text

但我得到的所有文字在所有嵌套標籤加評論之間。

任何人都可以幫我取得「這是我的文字」嗎？

來源

2013-05-30 ɥɔǝnq ɹǝƃloɥ

使用.children代替：

from bs4 import NavigableString, Comment 
print ''.join(unicode(child) for child in hit.children 
    if isinstance(child, NavigableString) and not isinstance(child, Comment))

是的，這是一個有點舞蹈。

輸出：

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}): 
...  print ''.join(unicode(child) for child in hit.children 
...   if isinstance(child, NavigableString) and not isinstance(child, Comment)) 
... 




     THIS IS MY TEXT

來源

2013-05-30 11:59:13

這會返回'u'\ n評論\ nText \ nsomething \ n這是我的文本\ n別的\ n''或'u'a commentTextsomethingThis是我的文本\'其他'\'，其中有更多的文本比需要。 –

@CristianCiupitu：當然，你是對的，在這裏沒有注意。更新。 –

這是唯一的解決方案，它不依賴於文本與特定其他文本的順序或位置關係，而是從指定的標籤/元素中提取所有文本，同時忽略子標籤/元素的文本（或其他內容）。謝謝！這是尷尬的，但它的工作和解決我的問題（我不是OP，但有類似的需求）。 – geewiz

您可以使用.contents：

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}): 
...  print hit.contents[6].strip() 
... 
THIS IS MY TEXT

來源

2013-05-30 12:27:58 TerryA

謝謝，但文本並不總是在相同的地方。無論如何，它會工作嗎？ –

@ɥɔǝnqɹǝƃloɥ唉，不是。也許使用其他人的答案 – TerryA

數字'6'表示什麼？ – User

詳細瞭解如何導航through the parse tree in BeautifulSoup。解析樹已得到tags和NavigableStrings（因爲這是一個文本）。一個例子

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>', 
     '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', 
     '<p id="secondpara" align="blah">This is paragraph <b>two</b>.', 
     '</html>'] 
soup = BeautifulSoup(''.join(doc)) 

print soup.prettify() 
# <html> 
# <head> 
# <title> 
# Page title 
# </title> 
# </head> 
# <body> 
# <p id="firstpara" align="center"> 
# This is paragraph 
# <b> 
#  one 
# </b> 
# . 
# </p> 
# <p id="secondpara" align="blah"> 
# This is paragraph 
# <b> 
#  two 
# </b> 
# . 
# </p> 
# </body> 
# </html>

要下移你有contents和string解析樹。

內容是標籤的有序列表和NavigableString對象包含在一個頁面元素中
如果一個標籤只有一個子節點，該子節點是字符串，子節點可用作tag.string，以及 tag.contents [0]

針對上述情況，也就是說，你可以得到

soup.b.string 
# u'one' 
soup.b.contents[0] 
# u'one'

對於幾個孩子節點，你可以有例如

pTag = soup.p 
pTag.contents 
# [u'This is paragraph ', <b>one</b>, u'.']

所以在這裏你可以與contents玩，獲取你想要的索引的內容。

你也可以迭代一個標籤，這是一個快捷方式。例如，

for i in soup.body: 
    print i 
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p> 
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

來源

2013-05-30 12:46:44 octoback

'hit.string'是'None'和'hit.contents [0] ''''''''''所以請爲這個問題的例子提供一個答案。 –

所以在這裏你可以玩內容並獲得你想要的索引的內容。 – octoback

是對問題的回答 – octoback

的BeautifulSoup documentation提供關於從使用提取方法的文件刪除對象的例子。在下面的例子中，目的是要刪除文檔中的所有註釋：

移除構件

一旦你有一個元素的引用，您可以用提取物撕出樹方法。此代碼刪除所有評論 從文檔：

from BeautifulSoup import BeautifulSoup, Comment 
soup = BeautifulSoup("""1<!--The loneliest number--> 
        <a>2<!--Can be as bad as one--><b>3""") 
comments = soup.findAll(text=lambda text:isinstance(text, Comment)) 
[comment.extract() for comment in comments] 
print soup 
# 1 
# <a>2<b>3</b></a>

來源

2013-05-30 13:10:09

簡短的回答：soup.findAll('p')[0].next

真正的答案：你需要一個不變的參考點，從中可以得到你的目標。

你在你的評論中提到海德羅的回答，你想要的文本並不總是在同一個地方。找出它與某個元素在相同位置的感覺。然後找出如何讓BeautifulSoup在不變路徑之後導航分析樹。

例如，在原始帖子中提供的HTML中，目標字符串緊接在第一個段落元素後面出現，並且該段落不是空的。由於findAll('p')將會找到段落元素，soup.find('p')[0]將成爲第一段落元素。

你可以在這種情況下使用soup.find('p')，但soup.findAll('p')[n]更通用，因爲也許你的實際情況需要第5段或類似的東西。

next field屬性將成爲樹中下一個已解析的元素，包括子元素。因此soup.findAll('p')[0].next包含該段的文本，並且soup.findAll('p')[0].next.next將返回您提供的HTML中的目標。

來源

2013-05-31 03:46:28

用自己的湯對象：

soup.p.next_sibling.strip()

你搶<p>直接與soup.p *（這取決於它是第一個<p>解析樹）
然後使用next_sibling對soup.p返回的標記對象，因爲所需文本嵌套在解析樹的相同級別，因爲它們與<p>
.strip()僅僅是一個Python海峽方法除去開頭和結尾的空白

*否則只是find使用您的filter（S）

選擇在解釋的元素，這看起來是這樣的：

In [4]: soup.p 
Out[4]: <p>something</p> 

In [5]: type(soup.p) 
Out[5]: bs4.element.Tag 

In [6]: soup.p.next_sibling 
Out[6]: u'\n  THIS IS MY TEXT\n  ' 

In [7]: type(soup.p.next_sibling) 
Out[7]: bs4.element.NavigableString 

In [8]: soup.p.next_sibling.strip() 
Out[8]: u'THIS IS MY TEXT' 

In [9]: type(soup.p.next_sibling.strip()) 
Out[9]: unicode

來源

2014-07-18 21:05:58

您能否添加更多關於如何回答此問題的解釋性文字？ –

很高興！（往上看） –

soup = BeautifulSoup(html) 
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}): 
    hit = hit.text.strip() 
    print hit

這將打印：這是我的文本試試這個..

來源

2018-01-24 10:17:22 Naiswita

元素之間的Python BeautifulSoup提取文本

回答

相關問題