2015-04-22 44 views
1

我試圖在一組推文上使用多項樸素貝葉斯分類。CountVectorizer中的樣本數不一致

這裏是我的代碼:

import codecs 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.naive_bayes import MultinomialNB 
trainfile = 'train.txt' 
testfile = 'test.txt' 
word_vectorizer = CountVectorizer(analyzer='word') 
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8')) ## Error here 
tags = ['Pro_vax','Anti_vax','Neither'] 
mnb = MultinomialNB() 
mnb.fit(trainset, tags) 
codecs.open(testfile,'r','utf8') 
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8')) 
results = mnb.predict(testset) 
print results 

文件train.txt有下列文字是:

Vaccines are a very good idea. They prevent all sorts of deadly diseases. 
Vaccines cause autism. Do not vaccinate your children 
Going to read about vaccines. Then, I am going to see my brother with autism. 

我使用tags變量標記它們。

文件test.txt有followong文本:

Do not get your kids vaccinated. Vaccination and autism are correlated. 

當我運行該腳本,我得到以下錯誤:

ValueError: Found arrays with inconsistent numbers of samples: [3 9] 

我不熟悉的錯誤。這是什麼意思,我怎樣才能防止它再次出現?

回答

1

如果您給出了完整的回溯信號,將會更容易看到,但它看起來像標籤包含9個條目,而列車只包含三個訓練數據點。 tags是什麼樣的?