請看documentation of cross-validation at scikit更瞭解它。
另外您還錯誤地使用了cross_val_predict
。它將在內部調用您提供的cv
(cv
= 10)將提供的數據(即您的情況中的X_train,t_train)再次分解爲訓練和測試,將估計器擬合到列車上並預測測試中的數據。
現在上車數據的X_test
,y_test
,你應該先滿足您的estimtor的使用(cross_val_predict將不適合),然後用它來預測測試數據,然後計算精度。
簡單的代碼片段來描述上述(從您的代碼借款)(請閱讀註釋,並詢問是否也不懂):
# item feature matrix in X
X = data[features[:-1]].as_matrix()
# remove first column because it is not necessary in the analysis
X = np.delete(X,0,axis=1)
# divide in training and test set
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)
# Until here everything is good
# You keep away 20% of data for testing (test_size=0.2)
# This test data should be unseen by any of the below methods
# define method
logreg=LogisticRegression()
# Ideally what you are doing here should be correct, until you did anything wrong in dataframe operations (which apparently has been solved)
#cross valitadion prediction
#This cross validation prediction will print the predicted values of 't_train'
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
# internal working of cross_val_predict:
#1. Get the data and estimator (logreg, X_train, t_train)
#2. From here on, we will use X_train as X_cv and t_train as t_cv (because cross_val_predict doesnt know that its our training data) - Doubts??
#3. Split X_cv, t_cv into X_cv_train, X_cv_test, t_cv_train, t_cv_test by using its internal cv
#4. Use X_cv_train, t_cv_train for fitting 'logreg'
#5. Predict on X_cv_test (No use of t_cv_test)
#6. Repeat steps 3 to 5 repeatedly for cv=10 iterations, each time using different data for training and different data for testing.
# So here you are correctly comparing 'predicted' and 't_train'
print(metrics.accuracy_score(t_train, predicted))
# The above metrics will show you how our estimator 'logreg' works on 'X_train' data. If the accuracies are very high it may be because of overfitting.
# Now what to do about the X_test and t_test above.
# Actually the correct preference for metrics is this X_test and t_train
# If you are satisfied by the accuracies on the training data then you should fit the entire training data to the estimator and then predict on X_test
logreg.fit(X_train, t_train)
t_pred = logreg(X_test)
# Here is the final accuracy
print(metrics.accuracy_score(t_test, t_pred))
# If this accuracy is good, then your model is good.
如果你有較少的數據或不想要將數據分割成培訓和測試,那麼你應該使用的方法由@fuzzyhedge
# Use cross_val_score on your all data
scores = model_selection.cross_val_score(logreg, X, y, cv=10)
# 'cross_val_score' will almost work same from steps 1 to 4
#5. t_cv_pred = logreg.predict(X_cv_test) and calculate accuracy with t_cv_test.
#6. Repeat steps 1 to 5 for cv_iterations = 10
#7. Return array of accuracies calculated in step 5.
# Find out average of returned accuracies to see the model performance
scores = scores.mean()
注意的建議 - 也cross_validation最好用gridsearch用來找出估計的參數,針對給定數據執行最佳操作。 例如,使用LogisticRegression它定義了許多參數。但是,如果使用
logreg = LogisticRegression()
將僅使用默認參數初始化模型。也許參數值不同
logreg = LogisticRegression(penalty='l1', solver='liblinear')
可能對您的數據執行效果更好。這個搜索更好的參數是gridsearch。
現在至於你的第二部分scaling, dimension reductions等使用管道。您可以參考documentation of pipeline和下面的例子:
隨時聯繫我,如果需要任何幫助。
非常感謝,男人!我修復了代碼,現在它可以工作。功能中的目標並不是一個真正的問題,因爲我的代碼中的-1被拿走了,因爲它是最後一列。所以真正的問題是,事實上目標不是np.array,正如你指出的那樣(我說,我真的不明白它與機器返回的大小錯誤有什麼神祕的關係)。 您是否對如何完成這個過程有所瞭解,即如何進行最終測試?我對我現在應該做的事情有點困惑。 – Harnak
我修改了我的答案,使用'model_selection.cross_val_score'包含了一個完整的過程。至於尺寸錯誤,在pd.dataframes和np.ndarrays之間工作可能會很痛苦。您可以使用'x.shape'打印每個模糊的故障排除。學習這些東西的最好方法是挖掘sklearn文檔和教程。 – 2017-02-17 20:45:04
我不確定我是否理解正確。那麼,使用cross_val_score可以使之前的分割變得不必要?我的意思是:不應該只在訓練集上進行交叉驗證,而不是在整套上進行交叉驗證?或者我可能錯過了交叉驗證的重點。 – Harnak