我努力訓練使用scikit一些文本數據。同樣的代碼被其他電腦上使用沒有任何錯誤,但在我的系統提示錯誤:的UnicodeDecodeError:「UTF-8」編解碼器不能在1266位置解碼字節0xba:無效的起始字節
File "/root/Desktop/karim/svn/questo-anso/v5/trials/classify/domain_detection_final/test_classifier_temp.py", line 130, in trainClassifier
X_train = self.vectorizer.fit_transform(self.data_train.data)
File "/root/Desktop/karim/software/scikit-learn-0.15.1/sklearn/feature_extraction/text.py", line 1270, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "/root/Desktop/karim/software/scikit-learn-0.15.1/sklearn/feature_extraction/text.py", line 808, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "/root/Desktop/karim/software/scikit-learn-0.15.1/sklearn/feature_extraction/text.py", line 741, in _count_vocab
for feature in analyze(doc):
File "/root/Desktop/karim/software/scikit-learn-0.15.1/sklearn/feature_extraction/text.py", line 233, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/root/Desktop/karim/software/scikit-learn-0.15.1/sklearn/feature_extraction/text.py", line 111, in decode
doc = doc.decode(self.encoding, self.decode_error)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xba in position 1266: invalid start byte
我已經籤類似的主題,但沒有幫助。
UPDATE:
self.data_train = self.fetch_data(cache, subset='train')
if not os.path.exists(self.root_dir+"/autocreated/vectorizer.txt"):
self.vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
start_time = time()
print("Transforming the dataset")
X_train = self.vectorizer.fit_transform(self.data_train.data) // Error is here
joblib.dump(self.vectorizer, self.root_dir+"/autocreated/vectorizer.txt")
0xba確實是一個無效的起始字節,有什麼問題? – 2014-09-01 06:23:28
編碼文本即'text.encode(「utf-8」)'和審查文本,你可能會得到線索 – MaNKuR 2014-09-01 06:33:24
@nm:即使我不知道,編碼是好的,但不知道爲什麼它顯示的編碼錯誤 – user123 2014-09-01 06:36:20