我有以下代碼從文本分類的一組文件(文件夾名稱是類別名稱)中提取功能。使用scikit-learn令牌化文本
import sklearn.datasets
from sklearn.feature_extraction.text import TfidfVectorizer
train = sklearn.datasets.load_files('./train', description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
print len(train.data)
print train.target_names
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train.data)
這將引發以下堆棧跟蹤:
Traceback (most recent call last):
File "C:\EclipseWorkspace\TextClassifier\main.py", line 16, in <module>
X_train = vectorizer.fit_transform(train.data)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 1285, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 739, in _count_vocab
for feature in analyze(doc):
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 236, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 113, in decode
doc = doc.decode(self.encoding, self.decode_error)
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 32054: invalid start byte
我運行Python 2.7。我怎樣才能使這個工作?
編輯: 我剛剛發現這工作得很好的文件與utf-8
編碼(我的文件ANSI
編碼)。有沒有什麼辦法可以讓sklearn.datasets.load_files()
與ANSI
編碼一起工作?
你能添加數據的樣本?可能數據不是用utf-8編碼的 - 也許它們在utf-16中?不知道更多關於數據格式的信息,這很困難..我不是專家,但你可以嘗試使用'each_string.decode('utf-16')之類的字符串將字符串轉換爲utf-8。encode('utf-8 ')' – ohruunuruus
@ohruunuruus我的訓練數據與20新聞組數據集類似,編碼爲ANSI – raul
'TfidfVectorizer'採用編碼參數。嘗試傳遞'encoding = ansi'並報告任何錯誤 – mbatchkarov