2016-01-08 111 views
4

使用Python 3和BeautifulSoup 4,我希望能夠從HTML頁面中提取文本,該文本只能通過上面的評論進行描述。舉個例子:使用BeautifulSoup提取HTML評論之間的文本

<\!--UNIQUE COMMENT--> 
I would like to get this text 
<\!--SECOND UNIQUE COMMENT--> 
I would also like to find this text 

我已經找到各種方法來提取網頁上的文本或註釋,但沒有辦法做我正在尋找。任何幫助將不勝感激。

回答

4

你只需要通過所有可用的意見進行迭代,看它是否是你所需要的一個條目,然後爲下面的元素顯示文本,如下所示:

from bs4 import BeautifulSoup, Comment 

html = """ 
<html> 
<body> 
<p>p tag text</p> 
<!--UNIQUE COMMENT--> 
I would like to get this text 
<!--SECOND UNIQUE COMMENT--> 
I would also like to find this text 
</body> 
</html> 
""" 
soup = BeautifulSoup(html, 'lxml') 

for comment in soup.findAll(text=lambda text:isinstance(text, Comment)): 
    if comment in ['UNIQUE COMMENT', 'SECOND UNIQUE COMMENT']: 
     print comment.next_element.strip() 

這顯示如下:

I would like to get this text 
I would also like to find this text 
+0

我剛纔正要這樣做。 +1 –

+0

正是我所需要的。非常感謝你。 – LANshark

1

Python的bs4模塊有一個Comment類。您可以使用該提取註釋。

from bs4 import BeautifulSoup, Comment 

html = """ 
<html> 
<body> 
<p>p tag text</p> 
<!--UNIQUE COMMENT--> 
I would like to get this text 
<!--SECOND UNIQUE COMMENT--> 
I would also like to find this text 
</body> 
</html> 
""" 
soup = BeautifulSoup(html, 'lxml') 
comments = soup.findAll(text=lambda text:isinstance(text, Comment)) 

這會給你註釋元素。

[u'UNIQUE COMMENT', u'SECOND UNIQUE COMMENT'] 
+0

我覺得OP試圖提取評論之間的文本,而不是評論本身。 –

+0

'我想得到這個文本' - 這是一個嗎? –

+0

是的,那個。我能夠提取評論就好。 – LANshark

2

到馬丁的答案的改進 - 您可以搜索直接具體意見 - 無需遍歷所有的評論,然後檢查值 - 做一氣呵成:

comments_to_search_for = {'UNIQUE COMMENT', 'SECOND UNIQUE COMMENT'} 
for comment in soup.find_all(text=lambda text: isinstance(text, Comment) and text in comments_to_search_for): 
    print(comment.next_element.strip()) 

打印:

I would like to get this text 
I would also like to find this text 
相關問題