的UnicodeDecodeError：「UTF-8」編解碼器不能在位置0解碼字節0x80的：無效的起始字節

我知道有很多關於中的編碼解碼的問題，但我似乎無法弄清楚了這一點：的UnicodeDecodeError：「UTF-8」編解碼器不能在位置0解碼字節0x80的：無效的起始字節

def content(title, sents): 
sent_elems = [] 
for sent_i, sent in enumerate(sents, 1): 


    elem = u"<a name=\"{i}\">[{i}]</a> <a href=\"#{i}\" id={i}>{text}</a>".format(i=sent_i, text=sent.text) 
    sent_elems.append(elem) 
doc = u"""<html> 
<head> 
<title>{title}</title> 
</head> 
<body>{elems}</body> 
</html>""".format(title=title, elems="\n".join(sent_elems)) 

return doc

調用內容功能會給我這個錯誤在非常罕見的情況下（在我的整個數據集，也許一兩次）：

File "processing.py", line 68, in score_summary 
self._write_config(references, summary) 
    File "processing.py", line 56, in _write_config 
reference_files = self._write_references(references, reference_dir) 
    File "processing.py", line 44, in _write_references 
f.write(rouge_summary_content(reference.id, reference.sents)) 
    File "processing.py", line 154, in rouge_summary_content 
</html>""".format(title=title, elems="\n".join(sent_elems)) 
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

我已經改變：

sent_elems.append(elem.decode("utf-8", "ignore"))

也

sent_elems.append(elem.decode("utf-8", "replace"))

還是同樣的錯誤。

我看了一下數據，卻無法弄清楚爲什麼會發生這種情況。我檢查了這個錯誤發生的文件，仍然沒有非utf8字符。

我也是在我的文件中添加了這個：

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8")

問題仍然是存在的。有什麼建議麼？

來源

2014-10-01 user3430235

不要**使用'sys.setdefaultencoding（）'。這類似於綁定一條斷腿並繼續前進，而不是去ER去設置一個陣容。東西仍然破損，你會在稍後感覺到疼痛，並且必須重置骨骼。 – 2014-10-01 21:01:42

這很可能是你的'title'是字節，而不是unicode。 – 2014-10-01 21:02:51

這會造成更多的問題。通過設置sys.setdefaultencoding（「utf-8」），我跳過了幾乎所有的編碼解碼錯誤。我需要擺脫或知道其來源的持續性案例很少。 – user3430235 2014-10-01 21:04:58

我的標題是chr(65+index)，所以當它結束所有大寫字母時，它會打印一些非UTF-8字符。我將它改爲str(index)，它解決了我原來的問題。

來源

2014-10-01 21:27:04 user3430235

不幸的是，這個問題並沒有解決。我有另一個錯誤。 – user3430235 2014-10-02 16:58:39

如果您的數據看起來像下面給出的一個：

data="0\x80\x06\t*\x86H\x86\xf7\r\x01\x07\x04\xa0\x800\x80\x02\x01\x01\x0e0\x0c\x06\b*\x86H\x86\xf7\r\x02\x05\x05....."

遵循下面的方法，我們可以把它在UTF8解碼

encoded = base64.b64encode(data) 
decoded = urllib.unquote(encoded).decode('utf8')

其結果將是像這樣：

MIAGCSqGSIb3DQEHAq...

來源

2016-10-11 09:52:19 vijay

的UnicodeDecodeError：「UTF-8」編解碼器不能在位置0解碼字節0x80的：無效的起始字節

回答

相關問題