Beautifulsoup丟失節點

我正在使用Python和Beautifulsoup來解析HTML數據並從RSS-Feeds中獲取p-tags。但是，有些URL會導致問題，因爲解析的湯對象不包含文檔的所有節點。Beautifulsoup丟失節點

比如我試圖解析http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm

但比較網頁源代碼解析的對象之後，我注意到，ul class="nextgen-left"後，所有的節點都不見了。

這是我如何解析文件：

from bs4 import BeautifulSoup as bs 

url = 'http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm' 

cj = cookielib.CookieJar() 
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 
request = urllib2.Request(url) 

response = opener.open(request) 

soup = bs(response,'lxml')   
print soup

來源

2013-05-01 Martin Golpashin

嘗試使用其他解析器; Feed中的HTML被破壞，不同的解析器處理的方式不同。 – 2013-05-01 10:56:49

輸入HTML是不太符合的，所以你必須在這裏使用一個不同的解析器。 html5lib解析器正確處理此頁：

>>> import requests 
>>> from bs4 import BeautifulSoup 
>>> r = requests.get('http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm') 
>>> soup = BeautifulSoup(r.text, 'lxml') 
>>> soup.find('div', id='story-body') is not None 
False 
>>> soup = BeautifulSoup(r.text, 'html5') 
>>> soup.find('div', id='story-body') is not None 
True

來源

2013-05-01 11:09:17

Beautifulsoup丟失節點

回答

相關問題