來自urlopen的垃圾

我想從下面的代碼中的地址讀取一些utf-8文件。它適用於其中的大部分，但對於某些文件，urllib2（和urllib）無法讀取。來自urlopen的垃圾

這裏很明顯的答案是第二個文件已損壞，但奇怪的是IE瀏覽器都讀取了它們，而且完全沒有問題。代碼已經在XP和Linux上進行了測試，結果相同。任何消化？

import urllib2 
#This works: 
f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/145/pg145.txt") 
line=f.readline() 
print "this works: %s)" %(line) 
line=unicode(line,'utf-8') #... works fine 

#This doesn't 
f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/144/pg144.txt") 
line=f.readline() 
print "this doesn't: %s)" %(line) 
line=unicode(line,'utf-8')#...causes an exception:

來源

2011-11-01 user1023380

您要求的URL似乎指的是私有緩存。改爲http://www.gutenberg.org/files/144/144-0.txt（在http://www.gutenberg.org/ebooks/144處找到）。

如果你真的想使用/cache/ URL：服務器發送你壓縮的數據，而不是unicode。 urllib2不會要求對gzip壓縮的數據進行解碼，這是正確的行爲。有關如何解壓縮，請參閱this question。

來源

2011-11-01 10:06:35

非常感謝，並感謝您的鏈接！ – user1023380

-1

你知道這不是一個解決方案，但你應該看看http://pypi.python.org/pypi/requests庫，不管你是否仍然想使用urllib都可以查看Requests的源代碼，瞭解它如何與utf-8字符串一起工作。

來源

2011-11-01 10:08:58

>>> f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/144/pg144.txt") 
>>> f.headers.dict 
{'content-length': '304513', ..., 'content-location': 'pg144.txt.utf8.gzip', 'content-encoding': 'gzip', ..., 'content-type': 'text/plain; charset=utf-8'}

要麼設置一個標頭，防止站點發送gzip編碼響應，或先解碼。

來源

2011-11-01 10:09:43 pyroscope

啊......我明白了。非常感謝！ – user1023380

來自urlopen的垃圾

回答

相關問題