2011-11-15 41 views
1

以前,在Python 2.6中,我使用了很多urllib.urlopen來捕獲 網頁內容,然後後來處理我收到的數據。現在,這些例程以及我正在嘗試用於Python 3.2的新例程正在運行,似乎只是一個窗口(甚至可能只是Windows 7的問題)。Python 2.6和3.2的問題在Windows上的urlopen例程

使用與Python 3.2.2(64)下面的代碼在Windows 7上......

import urllib.request 

fp = urllib.request.urlopen(URL_string_that_I_use) 

string = fp.read() 
fp.close() 
print(string.decode("utf8")) 

我得到以下信息:

Traceback (most recent call last): 
    File "TATest.py", line 5, in <module> 
    string = fp.read() 
    File "d:\python32\lib\http\client.py", line 489, in read 
    return self._read_chunked(amt) 
    File "d:\python32\lib\http\client.py", line 553, in _read_chunked 
    self._safe_read(2)  # toss the CRLF at the end of the chunk 
    File "d:\python32\lib\http\client.py", line 592, in _safe_read 
    raise IncompleteRead(b''.join(s), amt) 
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected) 

使用下面的代碼,而不是...

import urllib.request 

fp = urllib.request.urlopen(URL_string_that_I_use) 
for Line in fp: 
    print(Line.decode("utf8").rstrip('\n')) 
fp.close() 

我得到了相當數量的網頁的內容,但隨後捕捉其餘 由...

Traceback (most recent call last): 
    File "TATest.py", line 9, in <module> 
    for Line in fp: 
    File "d:\python32\lib\http\client.py", line 489, in read 
    return self._read_chunked(amt) 
    File "d:\python32\lib\http\client.py", line 545, in _read_chunked 
    self._safe_read(2) # toss the CRLF at the end of the chunk 
    File "d:\python32\lib\http\client.py", line 592, in _safe_read 
    raise IncompleteRead(b''.join(s), amt) 
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected) 

試圖讀取另一頁產量受挫...

Traceback (most recent call last): 
    File "TATest.py", line 11, in <module> 
    print(Line.decode("utf8").rstrip('\n')) 
    File "d:\python32\lib\encodings\cp1252.py", line 19, in encode 
    return codecs.charmap_encode(input,self.errors,encoding_table)[0] 
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 
21: character maps to <undefined> 

我相信這是一個Windows的問題,但可以蟒蛇進行更強大的處理 與是什麼造成的?在Linux上嘗試類似的代碼(版本2.6代碼)時,我們不會遇到問題。有沒有解決的辦法?我也發佈到gmane.comp.python.devel新聞組

回答

2

看起來您正在閱讀的頁面編碼爲cp1252

import urllib.request 

fp = urllib.request.urlopen(URL_string_that_I_use) 

string = fp.read() 
fp.close() 
print(string.decode("cp1252")) 

應該工作。

There are many方式來指定內容的字符集,但使用的HTTP標頭應該能滿足大多數網頁:

import urllib.request 

fp = urllib.request.urlopen(URL_string_that_I_use) 

string = fp.read().decode(fp.info().get_content_charset()) 
fp.close() 
print(string) 
+0

感謝塞斯。我一會兒沒有看這個,只是現在才意識到你已經回答了。我相信這將在未來有價值。 –

+0

@ThomIves不客氣。如果解決方案爲您工作,請將其標記爲已接受。 –