我正在使用scikit-learn進行文本分類。事情能夠很好地處理單個功能,但是引入多個功能會給我帶來錯誤。我認爲問題在於我沒有按照分類器預期的方式格式化數據。使用scikit-learn的多種功能
例如,這工作得很好:
data = np.array(df['feature1'])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)
classifier = Pipeline(...)
classifier.fit(X_train, Y_train)
但這:
data = np.array(df[['feature1', 'feature2']])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)
classifier = Pipeline(...)
classifier.fit(X_train, Y_train)
classifier.fit後
Traceback (most recent call last):
File "/Users/jed/Dropbox/LegalMetric/LegalMetricML/motion_classifier.py", line 157, in <module>
classifier.fit(X_train, Y_train)
File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 130, in fit
Xt, fit_params = self._pre_transform(X, y, **fit_params)
File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 120, in _pre_transform
Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 715, in _count_vocab
for feature in analyze(doc):
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 229, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 195, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
在預處理階段死亡()被調用。我認爲問題在於我對數據進行格式化,但我無法弄清楚如何正確使用它。
功能1和功能2都是英文文本字符串,與目標一樣。我使用LabelEncoder()來編碼目標,這似乎工作正常。
下面是print data
返回的示例,以便您瞭解它現在的格式。
[['some short english text'
'a paragraph of english text']
['some more short english text'
'a second paragraph of english text']
['some more short english text'
'a third paragraph of english text']]
那麼,你如何格式化數據?我通常發現,我可以直接將熊貓DataFrame傳遞給'scikit'函數,它可以正常工作。 – BrenBarn
我試着直接將'DataFrame'傳遞給'train_test_split()',我得到了同樣的錯誤。 'train_test_split(df ['feature1'],label_encoder.transform(df ['target']))''很好。 'train_test_split(df [['feature1','feature2']],label_encoder.transform(df ['matches']))'不'。 –
你可以打印出兩種情況下每個'X_train'的外觀。 – ely