使用python獲取網頁正文中的內容

我正在嘗試使用python掃描各種網站。以下代碼適合我。使用python獲取網頁正文中的內容

import urllib 
import re 
htmlfile =urllib.urlopen("http://google.com") 
htmltext=htmlfile.read() 
regex='<title>(.+?)</title>' 
pattern=re.compile(regex) 
title= re.findall(pattern,htmltext) 
print title

要得到的主體內容，我改變了它如下：

import urllib 
import re 
htmlfile =urllib.urlopen("http://google.com") 
htmltext=htmlfile.read() 
regex='<body>(.+?)</body>' 
pattern=re.compile(regex) 
title= re.findall(pattern,htmltext) 
print title

上面的代碼是給我一個空框支架。我不知道我做錯了什麼。請幫助

來源

2014-03-05 user2923505

通常嘗試parse HTML with regular expressions是個壞主意。

優秀beautiful soup library使你想要做的事情變得微不足道。

import bs4 

html = ''' 
<head> 
</head> 
<body> 
    <div></div> 
</body> 
''' 

print(bs4.BeautifulSoup(html).find('body'))

Python中也有一個HTML parser in its standard library，這基本上是美麗的湯分析器不太功能豐富的版本。

如果你仍然堅持使用正則表達式，這應該工作。

import re 
print(re.findall('<body>(.*?)</body>', html, re.DOTALL))

而且這聽起來愚蠢的，但要確保居然還有就是在htmltext串body標籤。

來源

2014-03-05 06:19:55 rectangletangle

感謝您的想法。我對美麗的衣服沒有太多的知識，但是你的建議很棒。 – user2923505

效果很好。但是，您如何擺脫''和'標籤呢？ – clemlaflemme

要回答這個問題，實際上如果你通過HTMLtext，你不會找到兩個body標籤。但我絕對建議你採取美麗的湯路線@rectangletangle提及

來源

2014-03-05 06:26:03 ForgetfulFellow

使用python獲取網頁正文中的內容

回答

相關問題