使用scikit-learn對文本文檔進行分類時的交叉驗證

您是否首先使用scikit-learn進行交叉驗證，然後進行特徵提取或以其他方式對文本文檔進行分類？使用scikit-learn對文本文檔進行分類時的交叉驗證

這裏是我的管道：

union = FeatureUnion(
transformer_list = [ 
('tfidf', TfidfVectorizer()), 
('featureEx', FeatureExtractor()), 
('spell_chker', Spellingchecker()), 
], n_jobs = -1)

我按以下方式做，但我不知道我是否應該先提取特徵並做交叉驗證。在此示例中，X是文檔列表，y是標籤。

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.2) 

X_train = union.fit_transform(X_train) 
X_test = union.transform(X_test) 

ch2 = SelectKBest(f_classif, k = 7000) 
X_train = ch2.fit_transform(X_train, y_train) 
X_test = ch2.transform(X_test) 

clf = SVC(C=1, gamma=0.001, kernel = 'linear', probability=True).fit(
X_train , y_train) 

print("classification report:") 
y_true, y_pred = y_test, clf.predict(X_test) 
print(classification_report(y_true, y_pred)) 
print()

來源

2015-09-22 user2161903