CountVectorizer中的樣本數不一致

我試圖在一組推文上使用多項樸素貝葉斯分類。CountVectorizer中的樣本數不一致

這裏是我的代碼：

import codecs 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.naive_bayes import MultinomialNB 
trainfile = 'train.txt' 
testfile = 'test.txt' 
word_vectorizer = CountVectorizer(analyzer='word') 
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8')) ## Error here 
tags = ['Pro_vax','Anti_vax','Neither'] 
mnb = MultinomialNB() 
mnb.fit(trainset, tags) 
codecs.open(testfile,'r','utf8') 
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8')) 
results = mnb.predict(testset) 
print results

文件train.txt有下列文字是：

Vaccines are a very good idea. They prevent all sorts of deadly diseases. 
Vaccines cause autism. Do not vaccinate your children 
Going to read about vaccines. Then, I am going to see my brother with autism.

我使用tags變量標記它們。

文件test.txt有followong文本：

Do not get your kids vaccinated. Vaccination and autism are correlated.

當我運行該腳本，我得到以下錯誤：

ValueError: Found arrays with inconsistent numbers of samples: [3 9]

我不熟悉的錯誤。這是什麼意思，我怎樣才能防止它再次出現？

來源

2015-04-22 user3600497

如果您給出了完整的回溯信號，將會更容易看到，但它看起來像標籤包含9個條目，而列車只包含三個訓練數據點。 tags是什麼樣的？

來源

2015-04-23 02:51:12

CountVectorizer中的樣本數不一致

回答

相關問題