從httplib解碼內容GET

我正在讀取CP-1250中的一個簡單的HTTP純文本（我無法影響它）並且想要對它進行解碼，每行處理它並最終將其保存爲UTF- 8。從httplib解碼內容GET

第一部分是我造成的問題。在我使用response.read()獲得原始數據後，我將它傳遞給由編解碼器庫中的getreader("cp1250")創建的閱讀器。我希望獲得一個StreamReader實例，並簡單地調用readlines以獲得一個字節字符串列表。

import codecs 
import httplib 

# nothing unusual 
conn = httplib.HTTPConnection('server') 
conn.request('GET', '/') 
response = conn.getresponse() 
content = response.read() 

# the painful part 
sr = codecs.getreader("cp1250")(content) 
lines = sr.readlines()  # d'oh!

但調用readlines我只得到後破口大罵從某處深裏面的編解碼器呼應：

[...snip...] 
    File "./download", line 123, in _parse 
    lines = sr.readlines() 
    File "/usr/lib/python2.7/codecs.py", line 588, in readlines 
    data = self.read() 
    File "/usr/lib/python2.7/codecs.py", line 471, in read 
    newdata = self.stream.read() 
AttributeError: 'str' object has no attribute 'read'

我print■確認sr是StreamReader的實例;它讓我感到困惑，該對象似乎初始化的很好，但現在無法執行readlines ......這裏缺少什麼？

或者是試圖隱藏地告訴我數據已損壞（不是CP-1250）的庫？

編輯：正如jorispilot建議的，unicode(content, encoding="cp1250")的作品，所以我可能會堅持我的解決方案。但是，我仍然想知道我使用編解碼器庫時出了什麼問題。

來源

2013-08-29 Alois Mahdal

您是否嘗試過使用_unicode（content，encoding =「cp1250」）_而不是編解碼器模塊？（或python3中的_str（content，encoding =「cp1250」）_） – jorispilot

此外，_unicode_函數將清楚地告訴您數據是否爲CP-1250損壞。 – jorispilot

據http://docs.python.org/2/library/codecs.html，getreader()返回StreamReader。這必須通過一個流，它實現了read()函數，而不是像你這樣做的一個字符串。

要解決此問題，請不要讀取response中的數據，而是直接將其傳遞給StreamReader，如下所示。

conn = httplib.HTTPConnection('server') 
conn.request('GET', '/') 
response = conn.getresponse() 

reader = codecs.getreader("cp1250")(response) 
lines = sr.readlines()

來源

2013-08-29 13:45:05

utf8_lines = [] 
for line in content.split('\n'): 
    line = line.strip().decode('cp1250') 
    utf8_lines.append(line.encode('utf-8'))

來源

2013-08-29 13:37:48 Fabian

從httplib解碼內容GET

回答

相關問題