2016-11-21 74 views
2

我試圖在一些訓練和測試數據上訓練svm模型。計劃效果很好,如果我結合試驗和訓練數據,但如果我把他們和檢驗模型準確性它說測試和訓練數據集具有不同數量的功能

Traceback (most recent call last): 
    File "/home/PycharmProjects/analysis.py", line 160, in <module> 
    main() 
    File "/home/PycharmProjects/analysis.py", line 156, in main 
    learn_model(tf_idf_train,target,tf_idf_test) 
    File "/home/PycharmProjects/analysis.py", line 113, in learn_model 
    predicted = classifier.predict(data_test) 
    File "/home/.local/lib/python3.4/site-packages/sklearn/svm/base.py", line 573, in predict 
    y = super(BaseSVC, self).predict(X) 
    File "/home/.local/lib/python3.4/site-packages/sklearn/svm/base.py", line 310, in predict 
    X = self._validate_for_predict(X) 
    File "/home/.local/lib/python3.4/site-packages/sklearn/svm/base.py", line 479, in _validate_for_predict 
    (n_features, self.shape_fit_[1])) 
    ValueError: X.shape[1] = 19137 should be equal to 4888, the number of features at training time 

這裏的測試集比動車組大。所以測試集自然比trainset有更多的特徵數,所以它的賦值錯誤。

這裏是我的代碼:

def load_train_file(): 
    with open('~1k comments.csv',encoding='ISO-8859-1',) as csv_file: 
    reader = csv.reader(csv_file,delimiter=",",quotechar='"') 
    reader.__next__() 
    data =[] 
    target = [] 
    for row in reader: 
    if row[0] and row[1]: 
    data.append(row[0]) 
    target.append(row[1]) 

    return data,target 


    def load_file(): 
    with open('comments.csv',encoding='ISO-8859-1',) as csv_file: 
    reader = csv.reader(csv_file,delimiter=",",quotechar='"') 
    reader.__next__() 
    data =[] 
    target = [] 
    for row in reader: 
    if row[0] and row[1]: 
    data.append(row[0]) 
    target.append(row[1]) 
    print(len(data)) 

    return data 

    # preprocess creates the term frequency matrix for the review data set 
    def preprocess(): 
    dataTrain,targetTrain = load_train_file() 
    testData=load_file() 
    count_vectorizer = CountVectorizer(binary='true') 
    dataTrain = count_vectorizer.fit_transform(dataTrain) 
    tfidf_train_data = TfidfTransformer(use_idf=True).fit_transform(dataTrain) 

    count_vectorizer = CountVectorizer() 
    testData = count_vectorizer.fit_transform(testData) 
    tfidf_test_data = TfidfTransformer(use_idf=True).fit_transform(testData) 

    return tfidf_train_data,tfidf_test_data 

    def learn_model(data,target,testData): 
    data_train,data_test,target_train,target_test = cross_validation.train_test_split(data,target,test_size=0.001,random_state=43) 
    e = np.zeros(testData.shape[0]) 
    data_train1, data_test, target_train1, target_test = cross_validation.train_test_split(testData, e,test_size=.9,random_state=43) 
    classifier = SVC(gamma=.01, C=100.) 
    classifier.fit(data_train, target_train) 
    predicted = classifier.predict(data_test) 
    for x in range(0,50): 
    print(testData[x]+str(predicted[x])) 

    def evaluate_model(target_true,target_predicted): 
    print (classification_report(target_true,target_predicted)) 
    print ("The accuracy score is {:.2%}".format(accuracy_score(target_true,target_predicted))) 

    def main(): 
    data,target = load_train_file() 
    datatest=load_file() 


    tf_idf_train,tf_idf_test = preprocess() 
    # print(tf_idf_train.shape()) 
    # print(tf_idf_test.shape()) 

    learn_model(tf_idf_train,target,tf_idf_test) 
    # learn_model(data,target,datatest) 


    main() 

如何解決這個問題?

回答

5

同樣的向量和變壓器必須同時用於列車和測試零件;另外,向量化程序不應該適合測試數據。因此,而不是

count_vectorizer = CountVectorizer(binary='true') 
dataTrain = count_vectorizer.fit_transform(dataTrain) 
tfidf_train_data = TfidfTransformer(use_idf=True).fit_transform(dataTrain) 

count_vectorizer = CountVectorizer() 
testData = count_vectorizer.fit_transform(testData) 
tfidf_test_data = TfidfTransformer(use_idf=True).fit_transform(testData) 

使用這樣的事情:

count_vectorizer = CountVectorizer(binary=True) 
tfidf_transformer = TfidfTransformer(use_idf=True) 
dataTrain = count_vectorizer.fit_transform(dataTrain) 
tfidf_train_data = transformer.fit_transform(dataTrain) 

testData = count_vectorizer.transform(testData) 
tfidf_test_data = tfidf_transformer.transform(testData) 

您還可以使用Pipeline,使其更好:

from sklearn.pipeline import make_pipeline 
pipe = make_pipeline(
    CountVectorizer(binary=True), 
    TfidfTransformer(use_idf=True), 
) 
tfidf_train_data = pipe.fit_transform(dataTrain) 
tfidf_test_data = pipe.transform(testData) 

甚至使用TfidfVectorizer它結合CountVectorizer和TfidfTransformer在單個矢量化器對象:

from sklearn.feature_extraction.text import TfidfVectorizer 
vec = TfidfVectorizer(binary=True, use_idf=True) 
tfidf_train_data = vec.fit_transform(dataTrain) 
tfidf_test_data = vec.transform(testData) 
相關問題