移植在Python 2到Python 3： 'UTF-8編碼解碼器不能解碼字節'

喂予從2.移植在Python 2到Python 3： 'UTF-8編碼解碼器不能解碼字節'

試圖端口小片段到Python 3的Python 2：

def _download_database(self, url): 
    try: 
    with closing(urllib.urlopen(url)) as u: 
     return StringIO(u.read()) 
    except IOError: 
    self.__show_exception(sys.exc_info()) 
    return None

Python的3：

def _download_database(self, url): 
    try: 
    with closing(urllib.request.urlopen(url)) as u: 
     response = u.read().decode('utf-8') 
     return StringIO(response) 
    except IOError: 
    self.__show_exception(sys.exc_info()) 
    return None

但我仍然得到

utf-8 codec can't decode byte 0x8f in position 12: invalid start byte

我需要使用StringIO的，因爲它是一個壓縮文件，我想與功能解析它：

def _parse_zip(self, raw_zip): 
    try: 
    zip = zipfile.ZipFile(raw_zip) 

    filelist = map(lambda x: x.filename, zip.filelist) 
    db_file = 'IpToCountry.csv' if 'IpToCountry.csv' in filelist else filelist[0] 

    with closing(StringIO(zip.read(db_file))) as raw_database: 
     return_val = self.___parse_database(raw_database) 

    if return_val: 
     self._load_data() 

    except: 
    self.__show_exception(sys.exc_info()) 
    return_val = False 

    return return_val

raw_zip是download_database FUNC返回

來源

2015-12-15 Fragkiller

您收到的數據的編碼顯然是* not * UTF-8。它是什麼編碼？如果Web服務器是正確的，那麼HTTP響應的Content-Type頭應該告訴你，以及文檔中的HTML標籤（如果它是HTML）。 – dsh

許多網絡服務器的默認編碼是iso-8859-1。 – lavinio

[Here]（https://stackoverflow.com/search?q= [python-3] + codec + can％27t + decode + answers％3A1）在StackOverflow上存在問題，解釋字節解釋爲字符。 – dsh

utf-8無法解碼任意的二進制數據。

utf-8是一種可用於對文本進行編碼的字符編碼（例如，表示爲Python 3中的str類型 - 一系列Unicode碼點）轉換爲字符串（bytes）類型 - 字節序列（小整數在[0,255]間隔））並將其解碼回來。

utf-8不是唯一的字符編碼。有與UTF-8不兼容的字符編碼。即使.decode('utf-8')沒有引發異常，這並不意味着結果是正確的 - 如果您使用錯誤的字符編碼來解碼文本，您可能會得到mojibake。見A good way to get the charset/encoding of an HTTP response in Python。

您輸入的是zip文件 - 二進制數據不是文本，因此您不應該嘗試將其解碼爲文本。

Python 3幫助您找到與混合二進制數據和文本相關的錯誤。 要將代碼從Python 2移植到Python 3，您應該瞭解文本（Unicode）與二進制數據（字節）的區別。

str Python 2是一個可用於二進制數據和（編碼）文本的字節串。除非存在from __future__ import unicode_literals; '' literal在Python 2中創建了一個字節串。u''創建了unicode實例。在Python 3 str類型是Unicode。 bytes引用Python 3和Python 2.7上的字節序列（bytes是Python 2上的str的別名）。在Python 2/3上，b''創建了bytes實例。

urllib.request.urlopen(url)返回一個類文件對象（二進制文件），你可以把它當作是在某些情況下如，to decode remote gzipped content on-the-fly：

#!/usr/bin/env python3 import xml.etree.ElementTree as etree from gzip import GzipFile from urllib.request import urlopen, Request with urlopen(Request("http://smarkets.s3.amazonaws.com/oddsfeed.xml", headers={"Accept-Encoding": "gzip"})) as response, \ GzipFile(fileobj=response) as xml_file: for elem in getelements(xml_file, 'interesting_tag'): process(elem)

ZipFile()需要seek() -able文件，因此你可以」直接通過urlopen()。你必須先下載內容。你可以使用io.BytesIO()，把它包起來：

#!/usr/bin/env python3 import io import zipfile from urllib.request import urlopen url = "http://www.pythonchallenge.com/pc/def/channel.zip" with urlopen(url) as r, zipfile.ZipFile(io.BytesIO(r.read())) as archive: print({member.filename: archive.read(member) for member in archive.infolist()})

StringIO()是文本文件。它在Python 3中存儲Unicode。

來源

2015-12-17 18:45:41 jfs

謝謝，大獎。 – Fragkiller

如果你感興趣的是返回一個流處理器從功能（而不是對內容進行解碼的要求），可以使用的BytesIO代替StringIO：

from contextlib import closing 
from io import BytesIO 
from urllib.request import urlopen 

url = 'http://www.google.com' 


with closing(urlopen(url)) as u: 
    response = u.read() 
    print(BytesIO(response))

來源

2015-12-17 14:24:08

您發佈的鏈接，http://software77.net/geo-ip?DL=2試圖下載一個zip文件，它是二進制文件。

如果你有一個很好的理由這樣做，無論如何，使用latin-1作爲解碼器，則不應將一個二進制的BLOB到str（只使用BytesIO）
。

來源

2015-12-17 18:07:50

移植在Python 2到Python 3： 'UTF-8編碼解碼器不能解碼字節'

回答

相關問題