1
我試圖在一組推文上使用多項樸素貝葉斯分類。CountVectorizer中的樣本數不一致
這裏是我的代碼:
import codecs
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
trainfile = 'train.txt'
testfile = 'test.txt'
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8')) ## Error here
tags = ['Pro_vax','Anti_vax','Neither']
mnb = MultinomialNB()
mnb.fit(trainset, tags)
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = mnb.predict(testset)
print results
文件train.txt
有下列文字是:
Vaccines are a very good idea. They prevent all sorts of deadly diseases.
Vaccines cause autism. Do not vaccinate your children
Going to read about vaccines. Then, I am going to see my brother with autism.
我使用tags
變量標記它們。
文件test.txt
有followong文本:
Do not get your kids vaccinated. Vaccination and autism are correlated.
當我運行該腳本,我得到以下錯誤:
ValueError: Found arrays with inconsistent numbers of samples: [3 9]
我不熟悉的錯誤。這是什麼意思,我怎樣才能防止它再次出現?