從urllib請求獲取unicode

2011-08-12 135 views 2 likes

我正在運行以下代碼，試圖在某些HTML中查找特定信息。我有一個編碼/解碼問題，但是，我無法解決。從urllib請求獲取unicode

import urllib 
req = urllib.urlopen('http://securities.stanford.edu/1046/AAI00_01/') 
html = req.read() 
type(html) 
# <type 'str'> 
html.upper().find('HTML') 
# -1 
print html[0:20] 
# ??<HTML><HE 
html[0:10] 
# '\xff\xfe<\x00H\x00T\x00M\x00' 
req.headers['content-type'] 
# 'text/html' 
html = html.encode('utf-8') 
# Traceback (most recent call last): 
# File "<stdin>", line 1, in <module> 
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

這個問題的解決方案是什麼？我需要做的就是使用.find和正則表達式從頁面中獲取一些信息。

我使用Mac OSX並從終端內運行Python 2.6.1。

來源

2011-08-12 ChrisP

回答

如果您要將str轉換成unicode，則要使用html.decode而不是encode。

年紀大了，不好的建議：此外，因爲你似乎必須在一開始，有一個BOM，你可能要使用'utf_8_sig'作爲編碼，這將去掉BOM的解碼。

新的，更好的建議：實際上，從看到在輸出所有這些\x00的與BOM一起，它看起來更像編碼實際上是UTF-16，而不是UTF-8。所以，html.decode('utf-16')應該是要走的路。

來源

2011-08-12 22:29:50

相關問題

11. 從IClientMessageInspector獲取請求URI
12. 獲取要從請求時
13. 獲取從URL請求
14. 從獲取請求中獲取數據？
15. 從獲取請求獲取div數據
16. 獲取請求
17. Unicode字符請求
18. 轉換捲曲成請求或的urllib
19. 導入urllib模塊「請求」錯誤
20. urllib忽略身份驗證請求
21. urllib發出多個POST請求
22. Unicode問題Django-Python-URLLIB-MySQL
23. 如何使用請求模塊以unicode字符獲取URL？
24. HttpServletRequest獲取請求
25. 獲取GET請求
26. $ .ajax獲取請求
27. 獲取請求源在Django請求
28. 獲取ajax請求的請求標題
29. 無法使用urllib，urlib2甚至請求獲取https url - 一堆erros
30. 網頁抓取不檢索整個文檔urllib或請求