的UnicodeDecodeError當讀字典中的單詞用簡單的Python腳本文件

第一次做的Python在一段時間，我無法這樣做，當我運行與Python 3.0.1下面的腳本文件的簡單的掃描，的UnicodeDecodeError當讀字典中的單詞用簡單的Python腳本文件

with open("/usr/share/dict/words", 'r') as f: 
    for line in f: 
     pass

我得到這個異常：

Traceback (most recent call last): 
    File "/home/matt/install/test.py", line 2, in <module> 
    for line in f: 
    File "/home/matt/install/root/lib/python3.0/io.py", line 1744, in __next__ 
    line = self.readline() 
    File "/home/matt/install/root/lib/python3.0/io.py", line 1817, in readline 
    while self._read_chunk(): 
    File "/home/matt/install/root/lib/python3.0/io.py", line 1565, in _read_chunk 
    self._set_decoded_chars(self._decoder.decode(input_chunk, eof)) 
    File "/home/matt/install/root/lib/python3.0/io.py", line 1299, in decode 
    output = self.decoder.decode(input, final=final) 
    File "/home/matt/install/root/lib/python3.0/codecs.py", line 300, in decode 
    (result, consumed) = self._buffer_decode(data, self.errors, final) 
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1689-1692: invalid data

它炸燬該文件中的行是「阿根廷」，這似乎不屬於正常的任何方式。

更新：我補充說，

encoding="iso-8559-1"

的open（）調用，它解決了這一問題。

來源

2009-06-19 Matt R

你確定你不是故意的'異8859-1`？這似乎更普遍。另外，\ xf3在iso-8859中的Asunción中是「ó」（它是Unicode中的代碼點U + 00F3），但在UTF-8中，它將表示爲'\ xc3 \ xb3'。 – Malvolio 2011-08-02 06:41:36

@Malvolio：完全可能我輸入了錯誤的編碼名稱;-) – 2011-08-02 10:20:10

您是如何從「位置1689-1692」確定文件中哪些行被炸開的？這些數字將是它嘗試解碼的塊中的偏移量。你將不得不確定它是什麼 - 如何？

在交互提示試試這個：

buf = open('the_file', 'rb').read() 
len(buf) 
ubuf = buf.decode('utf8') 
# splat ... but it will give you the byte offset into the file 
buf[offset-50:60] # should show you where/what the problem is 
# By the way, from the error message, looks like a bad 
# FOUR-byte UTF-8 character ... interesting

來源

2009-06-19 10:55:12

你能檢查以確保它是有效的UTF-8嗎？要做到這一點的一種方式，在this SO question給出：

iconv -f UTF-8 /usr/share/dict/words -o /dev/null

還有其他方法可以做到同樣的事情。

來源

2009-06-19 10:42:29

它說，「iconv：位置9881上的非法輸入序列」 – 2009-06-19 12:11:13

的UnicodeDecodeError當讀字典中的單詞用簡單的Python腳本文件

回答

相關問題