美麗的湯，獲取警告，然後通過代碼中途錯誤

我遍歷每個處理日期（1月1日，1月2日，....，12月31日）的維基百科頁面。在每一頁上，我都拿出當天有生日的人的名字。然而，中途我的代碼（4月27日），我收到這樣的警告：美麗的湯，獲取警告，然後通過代碼中途錯誤

WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

然後，我得到一個錯誤的時候了：

Traceback (most recent call last): 
    File "wikipedia.py", line 29, in <module> 
     section = soup.find('span', id='Births').parent 
AttributeError: 'NoneType' object has no attribute 'parent'

基本上，我不能找出原因，我收到後一直到4月27日，它決定拋出這個警告和錯誤。這是4月27日頁：

April 27...

從我可以告訴，沒有什麼不同，但有會做到這一點這樣。仍然存在id =「出生」的跨度。

這裏是我的代碼，我呼籲所有的東西：

site = "http://en.wikipedia.org/wiki/"+a+"_"+str(b) 
    hdr = {'User-Agent': 'Mozilla/5.0'} 
    req = urllib2.Request(site,headers=hdr)  
    page = urllib2.urlopen(req) 
    soup = BeautifulSoup(page) 

    section = soup.find('span', id='Births').parent 
    births = section.find_next('ul').find_all('li') 

    for x in births: 
     #All the regex and parsing, don't think it's necessary to show

的錯誤是扔在讀取行：

section = soup.find('span', id='Births').parent

我這樣做的時候，我得到有很多的信息到4月27日（每個約35,000個元素的8個列表），但我不認爲這會是問題。如果有人有任何想法，我會很感激。由於

來源

2013-07-16 Alex Chumbley

它看起來像維基百科服務器提供網頁gzip壓縮：

>>> page.info().get('Content-Encoding') 
'gzip'

它並非沒有設想到的Accept-Encoding在請求頭，不過，嗯，這與其他人的服務器工作時的生活。

有很多的根源在那裏展示瞭如何用gzip壓縮的數據進行工作 - 這裏有一個： http://www.diveintopython.net/http_web_services/gzip_compression.html

而這裏的另一個： Does python urllib2 automatically uncompress gzip data fetched from webpage?

來源

2013-07-16 22:41:11

謝謝，我一定會檢查了這一點 –

美麗的湯，獲取警告，然後通過代碼中途錯誤

回答

相關問題