2012-09-15 73 views
0

我試圖編寫一個python谷歌API。獲得一些unicode問題。我真的很基本的PoC到目前爲止是:UnicodeDecodeError Python錯誤

#!/usr/bin/env python 
import urllib2 
from bs4 import BeautifulSoup   
query = "filetype%3Apdf" 
url = "http://www.google.com/search?sclient=psy-ab&hl=en&site=&source=hp&q="+query+"&btnG=Search" 
opener = urllib2.build_opener() 
opener.addheaders = [('User-agent', 'Mozilla/5.0')] 
response = opener.open(url) 
data = response.read() 
data = data.decode('UTF-8', 'ignore') 
data = data.encode('UTF-8', 'ignore') 
soup = BeautifulSoup(data) 
print u""+soup.prettify('UTF-8') 

我回溯是:

Traceback (most recent call last): 
    File "./google.py", line 22, in <module> 
print u""+soup.prettify('UTF-8') 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 48786: ordinal not in range(128) 

任何想法?

回答

4

你是你soup樹轉換爲UTF-8(編碼字節串),然後再做連接到一個空u''的unicode字符串

Python將自動嘗試和解碼編碼後的字節串,使用默認的編碼,這是ASCII,並且未能將UTF-8數據進行解碼。

你需要明確解碼prettify()輸出:

print u"" + soup.prettify('UTF-8').decode('UTF-8') 

Python Unicode HOWTO解釋這更好的,包括有關默認編碼。我真的很推薦你閱讀Joel Spolsky的article on Unicode