2014-02-25 50 views
2

我使用相同的代碼來獲取網絡文本,但大部分時間顯示「WARNING:root:某些字符無法解碼,並替換爲REPLACEMENT CHARACTER。 「,並且令人驚訝的是它有時會工作,例如我運行代碼12次,1次成功。Python BeautifulSoup擷取網頁,開啓和關閉相同的代碼

相同的代碼,相同的網址。這是爲什麼發生?

from bs4 import BeautifulSoup 
import re 
import urllib2 


url = "http://nz.sports.search.yahoo.com/search?p=basketball&fr=sports-nz-ss&age=1w&focuslim=age" 
page = urllib2.urlopen(url) 
soup = BeautifulSoup(page.read()) 

web_p = soup.find_all('span',class_='url') 

for web in web_p: 
    print web 

引用的細節,如下面:

Traceback (most recent call last): 
    File "C:\Python27\lib\idlelib\run.py", line 112, in main 
seq, request = rpc.request_queue.get(block=True, timeout=0.05) 
    File "C:\Python27\lib\Queue.py", line 176, in get 
    raise Empty 
Empty 
+0

張貼時引發錯誤出現的回溯。 – tsroten

+1

[美麗的湯,獲取警告,然後錯誤中途通過代碼]可能的重複(http://stackoverflow.com/questions/17688063/beautiful-soup-gets-warning-and-then-error-halfway-through-code) – isedev

回答

2

感謝isedev爲指導,在Does python urllib2 automatically uncompress gzip data fetched from webpage?的答案,將代碼更改爲下面的工作:

from StringIO import StringIO 
import gzip 
from bs4 import BeautifulSoup 
import re 
import urllib2 


request = urllib2.Request('http://nz.sports.search.yahoo.com/search?p=basketball&fr=sports-nz-ss&age=1w&focuslim=age') 
request.add_header('Accept-encoding', 'gzip') 
response = urllib2.urlopen(request) 
if response.info().get('Content-Encoding') == 'gzip': 
    buf = StringIO(response.read()) 
    f = gzip.GzipFile(fileobj=buf) 
    data = f.read() 

soup = BeautifulSoup(data) 

web_p = soup.find_all('span',class_='url') 

for web in web_p: 
    print web 


由於攪拌機的指導下,該代碼可以被簡化:

from bs4 import BeautifulSoup 
import requests 

html = requests.get('http://nz.sports.search.yahoo.com/search?p=basketball&fr=sports-nz-ss&age=1w&focuslim=age').text 
soup = BeautifulSoup(html) 
web_p = soup.find_all('span',class_='url') 
for web in web_p: 
    print web 
+0

你可以使用'requests'並讓它爲你解壓縮響應。 – Blender

+0

感謝攪拌機。你能告訴我更多關於這個嗎? –

+0

沒有什麼比它更多的了。你只需導入它並使用它:'html = requests.get(url).text' – Blender