2015-11-23 87 views
1

我試圖用一個DecisionTreeClassifier做一些分析,但它給我以下錯誤:輸入長度不匹配scikit

ValueError: Number of features of the model must match the input. Model n_features is 1 and input n_features is 4

我用同樣的培訓和測試集的SVC和一個GaussianNB分類器和那些都工作得很好。下面是我的代碼,我知道測試和訓練集具有相同的設計,也就是說,在矢量化之前,他們採用包含字符串的列表的形式。我不知道哪裏錯配是從

#classify with just scikit 

from sklearn.feature_extraction.text import TfidfVectorizer 
from tools.striper import stripe, cleanupfiles 
from tools.tweetprocessor import clean, wordclean 

from sklearn import svm 
from sklearn.naive_bayes import GaussianNB, MultinomialNB 
from sklearn.metrics import classification_report 
from sklearn import tree 

stripe(0.1) 

training = [] 
traininglabel = [] 
test = [] 
testlabel = [] 

with open('tempdata/goodtraining.txt','r') as f: 
    for line in f: 
     tweet = [wordclean(x) for x in clean(line.rstrip('\n')).split()] 
     tweet = [x for x in tweet if len(x) >= 3] 
     training.append(' '.join(tweet)) 
     traininglabel.append('good') 
with open('tempdata/badtraining.txt','r') as f: 
    for line in f: 
     tweet = [wordclean(x) for x in clean(line.rstrip('\n')).split()] 
     tweet = [x for x in tweet if len(x) >= 3] 
     training.append(' '.join(tweet)) 
     traininglabel.append('bad') 
with open('tempdata/goodtest.txt','r') as f: 
    for line in f: 
     tweet = [wordclean(x) for x in clean(line.rstrip('\n')).split()] 
     test.append(' '.join(tweet)) 
     testlabel.append('good') 
with open('tempdata/badtest.txt','r') as f: 
    for line in f: 
     tweet = [wordclean(x) for x in clean(line.rstrip('\n')).split()] 
     test.append(' '.join(tweet)) 
     testlabel.append('bad') 

vectorizer = TfidfVectorizer(min_df=0.1,max_df=0.9) 
train_vect = vectorizer.fit_transform(training) 
test_vect = vectorizer.fit_transform(test) 

print (train_vect) 
print (test_vect) 

classifier = tree.DecisionTreeClassifier() 
classifier.fit(train_vect.toarray(), traininglabel) 
predictions = classifier.predict(test_vect.toarray()) 

print (classification_report(testlabel, predictions)) 

cleanupfiles() 

回答

1

未來您需要更改

test_vect = vectorizer.fit_transform(test) 

test_vect = vectorizer.transform(test) 

向量化的fit()方法應該只在訓練被稱爲數據。

+0

這樣做。謝謝。 –