中世紀字符的UnicodeDecodeError

我正在嘗試在中世紀文本上運行nltk標記化程序。這些文本使用中世紀字符，如yogh（ȝ），thorn（þ）和eth（ð）。中世紀字符的UnicodeDecodeError

當我運行程序（下面粘貼）與標準Unicode（UTF-8）編碼，我得到以下錯誤：

Traceback (most recent call last): File "me_scraper_redux2.py", line 11, in <module> tokens = nltk.word_tokenize(open("ME_Corpus_sm/"+file, encoding="utf_8").read()) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 313, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

我曾嘗試其他編碼，如LATIN1等，並且這些解決了這個問題，但是由於這些編碼使用其他字符來填充空間，所以我沒有得到準確的結果。我認爲unicode可以處理這些角色。我做錯了什麼，還是有我應該使用的另一種編碼？這些文件最初是在utf-8中。見下面我的代碼：

import nltk 
import os, os.path 
import string 

from nltk import word_tokenize 
from nltk.corpus import stopwords 

files = os.listdir("ME_Corpus_sm/") 
for file in files: 
    # open, parse, and normalize the tokens (words) in the file 
    tokens = nltk.word_tokenize(open("ME_Corpus_sm/"+file, encoding="utf_8").read()) 
    tokens = [ token.lower() for token in tokens ] 
    tokens = [ ''.join(character for character in token if character not in string.punctuation) for token in tokens ] 
    tokens = [ token for token in tokens if token.isalpha() ] 
    tokens = [ token for token in tokens if not token in stopwords.words('english') ] 

# output maximum most frequent tokens and their counts 
    for tuple in nltk.FreqDist(tokens).most_common(50): 
     word = tuple[ 0 ] 
     count = str(tuple[ 1 ]) 
     print(word + "\t" + count)

來源

2015-05-12 Andrew WK

你能發表文章的一個非常小的摘要嗎？包含刺（可能是所述摘錄的二進制十六進制或base64編碼）？錯誤（「無效起始字節0x80」）似乎指向無效的UTF-8，因爲0x80是一個10xxxxxx字節，它應該是一個* continuation *代碼，永遠不會在令牌開始時被發現。它可以在ISO-8859-15（Latin1）文本中遇到，但是... – LSerni

FYI Unicode是* not * UTF-8。 –

馬丁，謝謝你，我還在圍着這些東西包着頭！ –

您的文件是無效的UTF-8。

也許這是部分UTF-8和其他一些垃圾？你可以嘗試：

open(..., encoding='utf-8', errors='replace')

與問號，而不是拋出一個錯誤，這可能會給你一個機會，看看問題出在哪裏，以取代非UTF-8序列。一般情況下，如果你在一個文件中混合使用編碼，你幾乎註定要失敗，因爲它們不能被可靠地分開。

來源

2015-05-12 17:35:42 bobince

bobince，無論出於何種原因，我還沒有這樣做。我現在已經這樣做了，而且工作得很好。所有的刺和eths和yoghs都顯示正常，所以我不知道問題出在哪裏。粗略看一下文字（大約2000行），我甚至無法看到任何明顯數量的非標點符號相關問號。非常感謝！ –

中世紀字符的UnicodeDecodeError

回答

相關問題