Python BeautifulSoup擷取網頁，開啓和關閉相同的代碼

我使用相同的代碼來獲取網絡文本，但大部分時間顯示「WARNING：root：某些字符無法解碼，並替換爲REPLACEMENT CHARACTER。「，並且令人驚訝的是它有時會工作，例如我運行代碼12次，1次成功。Python BeautifulSoup擷取網頁，開啓和關閉相同的代碼

相同的代碼，相同的網址。這是爲什麼發生？

from bs4 import BeautifulSoup 
import re 
import urllib2 


url = "http://nz.sports.search.yahoo.com/search?p=basketball&fr=sports-nz-ss&age=1w&focuslim=age" 
page = urllib2.urlopen(url) 
soup = BeautifulSoup(page.read()) 

web_p = soup.find_all('span',class_='url') 

for web in web_p: 
    print web

引用的細節，如下面：

Traceback (most recent call last): 
    File "C:\Python27\lib\idlelib\run.py", line 112, in main 
seq, request = rpc.request_queue.get(block=True, timeout=0.05) 
    File "C:\Python27\lib\Queue.py", line 176, in get 
    raise Empty 
Empty

來源

2014-02-25 Mark K

張貼時引發錯誤出現的回溯。 – tsroten

[美麗的湯，獲取警告，然後錯誤中途通過代碼]可能的重複（http://stackoverflow.com/questions/17688063/beautiful-soup-gets-warning-and-then-error-halfway-through-code） – isedev

感謝isedev爲指導，在Does python urllib2 automatically uncompress gzip data fetched from webpage?的答案，將代碼更改爲下面的工作：

from StringIO import StringIO 
import gzip 
from bs4 import BeautifulSoup 
import re 
import urllib2 


request = urllib2.Request('http://nz.sports.search.yahoo.com/search?p=basketball&fr=sports-nz-ss&age=1w&focuslim=age') 
request.add_header('Accept-encoding', 'gzip') 
response = urllib2.urlopen(request) 
if response.info().get('Content-Encoding') == 'gzip': 
    buf = StringIO(response.read()) 
    f = gzip.GzipFile(fileobj=buf) 
    data = f.read() 

soup = BeautifulSoup(data) 

web_p = soup.find_all('span',class_='url') 

for web in web_p: 
    print web

由於攪拌機的指導下，該代碼可以被簡化：

from bs4 import BeautifulSoup 
import requests 

html = requests.get('http://nz.sports.search.yahoo.com/search?p=basketball&fr=sports-nz-ss&age=1w&focuslim=age').text 
soup = BeautifulSoup(html) 
web_p = soup.find_all('span',class_='url') 
for web in web_p: 
    print web

來源

2014-02-25 04:14:57

你可以使用'requests'並讓它爲你解壓縮響應。 – Blender

感謝攪拌機。你能告訴我更多關於這個嗎？ –

沒有什麼比它更多的了。你只需導入它並使用它：'html = requests.get（url）.text' – Blender

Python BeautifulSoup擷取網頁，開啓和關閉相同的代碼

回答

相關問題