Scikit-learn有幾個明確設計用於從文本輸入中提取特徵的工具;請參閱文檔的Text Feature Extraction部分。
下面是從一個字符串列表構建的分類器的例子:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
data = [['this is about dogs', 'dogs are really great'],
['this is about cats', 'cats are evil']]
labels = ['dogs',
'cats']
vec = CountVectorizer() # count word occurrences
X = vec.fit_transform([' '.join(row) for row in data])
clf = MultinomialNB() # very simple model for word counts
clf.fit(X, labels)
new_data = ['this is about cats too', 'I think cats are awesome']
new_X = vec.transform([' '.join(new_data)])
print(clf.predict(new_X))
# ['cats']
如果超過50%的丟失,而且他們的文本數據,那麼他們怎麼能是有用的?您可能需要向原始數據提供附加信息。 – Leb
如果你還沒有,檢查'csv'模塊是否可以幫助。 [PMOTW]上的示例(https://pymotw.com/2/csv/) – Pynchia