使用BeautifulSoup提取HTML評論之間的文本

使用Python 3和BeautifulSoup 4，我希望能夠從HTML頁面中提取文本，該文本只能通過上面的評論進行描述。舉個例子：使用BeautifulSoup提取HTML評論之間的文本

<\!--UNIQUE COMMENT--> 
I would like to get this text 
<\!--SECOND UNIQUE COMMENT--> 
I would also like to find this text

我已經找到各種方法來提取網頁上的文本或註釋，但沒有辦法做我正在尋找。任何幫助將不勝感激。

來源

2016-01-08 LANshark

你只需要通過所有可用的意見進行迭代，看它是否是你所需要的一個條目，然後爲下面的元素顯示文本，如下所示：

from bs4 import BeautifulSoup, Comment 

html = """ 
<html> 
<body> 
<p>p tag text</p> 
<!--UNIQUE COMMENT--> 
I would like to get this text 
<!--SECOND UNIQUE COMMENT--> 
I would also like to find this text 
</body> 
</html> 
""" 
soup = BeautifulSoup(html, 'lxml') 

for comment in soup.findAll(text=lambda text:isinstance(text, Comment)): 
    if comment in ['UNIQUE COMMENT', 'SECOND UNIQUE COMMENT']: 
     print comment.next_element.strip()

這顯示如下：

I would like to get this text 
I would also like to find this text

來源

2016-01-08 10:22:20

我剛纔正要這樣做。 +1 –

正是我所需要的。非常感謝你。 – LANshark

Python的bs4模塊有一個Comment類。您可以使用該提取註釋。

from bs4 import BeautifulSoup, Comment 

html = """ 
<html> 
<body> 
<p>p tag text</p> 
<!--UNIQUE COMMENT--> 
I would like to get this text 
<!--SECOND UNIQUE COMMENT--> 
I would also like to find this text 
</body> 
</html> 
""" 
soup = BeautifulSoup(html, 'lxml') 
comments = soup.findAll(text=lambda text:isinstance(text, Comment))

這會給你註釋元素。

[u'UNIQUE COMMENT', u'SECOND UNIQUE COMMENT']

來源

2016-01-08 10:00:46

我覺得OP試圖提取評論之間的文本，而不是評論本身。 –

'我想得到這個文本' - 這是一個嗎？ –

是的，那個。我能夠提取評論就好。 – LANshark

到馬丁的答案的改進 - 您可以搜索直接具體意見 - 無需遍歷所有的評論，然後檢查值 - 做一氣呵成：

comments_to_search_for = {'UNIQUE COMMENT', 'SECOND UNIQUE COMMENT'} 
for comment in soup.find_all(text=lambda text: isinstance(text, Comment) and text in comments_to_search_for): 
    print(comment.next_element.strip())

打印：

I would like to get this text 
I would also like to find this text

來源

2016-01-08 16:27:43 alecxe

使用BeautifulSoup提取HTML評論之間的文本

回答

相關問題