Python：編碼錯誤 - 網頁內容

我試圖得到一個網頁的內容，並解析它比保存在MySQL數據庫。Python：編碼錯誤 - 網頁內容

我其實做了一個網頁編碼utf8。

但是，當我嘗試與8859-9編碼網頁我得到錯誤。

我的代碼來獲取頁面的內容：

def getcontent(url): 
    opener = urllib2.build_opener() 
    opener.addheaders = [('User-agent', 'Magic Browser')] 
    opener.addheaders = [('Accept-Charset', 'utf-8')] 
    #print chardet.detect(response).get('encoding) 
    response = opener.open(url).read() 
    opener.close() 
    return response 



url  = "http://www.meb.gov.tr/duyurular/index.asp?ID=4" 
contentofpage = getcontent(url) 
print contentofpage 
print chardet.detect(contentofpage) 
print contentofpage.encode("utf-8")

的頁面的內容輸出： ... Eitim Teknolojileri內爾藥耐藥 ...

{'confidence': 0.7789909202570836, 'encoding': 'ISO-8859-2'} 


Traceback (most recent call last): 
    File "meb.py", line 18, in <module> 
    print contentofpage.encode("utf-8") 
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 458: ordinal not  in range(128)

其實頁面是土耳其語頁面，編碼是8859-9。

當我用默認編碼嘗試所有我看到而不是一些字符。我如何可以採取或轉換頁面的內容爲UTF-8或土耳其（ISO-8859-9）

而且當我使用的Unicode（contentofpage）

它得到

回溯（最近通話最後）：文件「meb.py」，第20行，在打印的unicode（contentofpage） UnicodeDecodeError錯誤： 'ASCII' 編解碼器不能在458位置解碼字節0xee：順序不在範圍內（128）

任何幫幫我？

來源

2013-01-06 MatandDie

我想你想解碼，而不是編碼，因爲它已經編碼。

print contentofpage.decode("iso-8859-9")

產生像樣品：

Eğitim Teknolojileri Genel Müdürlüğü

來源

2013-01-06 09:06:36 sberry

打印contentofpage.decode（「ISO-8859-9」） UnicodeEncodeError： 'ASCII' 編解碼器無法編碼的字符U '\ XEE' 在位置458：序號不在範圍內（128） – MatandDie

確保在獲取內容後直接進行解碼。 'contentofpage = getcontent（url）'，然後'print contentofpage.decode（'iso-8859-9'）'。 –

Python：編碼錯誤 - 網頁內容

回答

相關問題