2015-11-08 64 views
0

我試圖分別使用不同的數據集作爲火車和測試集。但與下面的代碼我得到:scikit-learn:如何使用兩個不同的數據集作爲火車和測試集

File "main.py", line 84, in main_test X2 = tf_transformer.transform(word_counts2) File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 1020, in transform n_features, expected_n_features)) ValueError: Input has n_features=1293 while the model has been trained with n_features=1625

def main_test(path = None): 
    dir_path = path or 'dataset' 
    files = sklearn.datasets.load_files(dir_path) 
    util.refine_all_emails(files.data) 
    word_counts = util.bagOfWords(files.data) 
    tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True) 
    tf_transformer.fit(word_counts) 
    X = tf_transformer.transform(word_counts) 

    dir_path = 'testset' 
    files2 = sklearn.datasets.load_files(dir_path) 
    util.refine_all_emails(files2.data) 
    word_counts2 = util.bagOfWords(files2.data) 
    # tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True) 
    # tf_transformer.fit(word_counts2) 
    X2 = tf_transformer.transform(word_counts2) 

    clf = sklearn.svm.LinearSVC() 

    test_classifier(X, files.target, clf, X2, files2.target, test_size=0.2, y_names=files.target_names, confusion=False) 


def test_classifier(X, y, clf, X2, y2, test_size=0.4, y_names=None, confusion=False): 
    X_train, X_test, y_train, y_test = X, X2, y, y2 
    clf.fit(X_train, y_train) 
    # clf.fit(X_test, y_test) 
    y_predicted = clf.predict(X_test) 

    print colored('Classification report:', 'magenta', attrs=['bold']) 
    print sklearn.metrics.classification_report(y_test, y_predicted, target_names=y_names) 

回答

0

這是因爲當你調用

word_counts2 = util.bagOfWords(files2.data) 

它產生的結果與TFIDF變壓器在訓練集從未見過的詞,並沒有按」對於這些詞語具有反頻率。

您只需要對訓練集中出現的單詞進行計數,也許CountVectorizer將對此有幫助。

+0

如何在R中做同樣的事情? –

相關問題