2016-02-03 67 views
2

我有一組文檔和一組標籤。 現在,我正在使用train_test_split以90:10的比例分割我的數據集。但是,我希望使用Kfold交叉驗證。我如何做K摺疊交叉驗證分裂列車和測試集?

train=[] 

with open("/Users/rte/Documents/Documents.txt") as f: 
    for line in f: 
     train.append(line.strip().split()) 

labels=[] 
with open("/Users/rte/Documents/Labels.txt") as t: 
    for line in t: 
     labels.append(line.strip().split()) 

X_train, X_test, Y_train, Y_test= train_test_split(train, labels, test_size=0.1, random_state=42) 

當我嘗試scikit的文檔中提供的方法學:我收到一個錯誤,指出:

kf=KFold(len(train), n_folds=3) 

for train_index, test_index in kf: 
    X_train, X_test = train[train_index],train[test_index] 
    y_train, y_test = labels[train_index],labels[test_index] 

錯誤

X_train, X_test = train[train_index],train[test_index] 
TypeError: only integer arrays with one element can be converted to an index 

我如何可以執行10個折交叉在我的文檔和標籤上驗證?

+0

什麼您是否嘗試過讓Kfold交叉驗證工作?你有沒有看到[文檔頁面]上的例子(http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold)? –

+0

是的,我已經嘗試了在我的文檔和標籤集上給出的例子,但我收到一個錯誤:* X_train,X_test = train [train_index],train [test_index] TypeError:只有一個元素的整數數組可以轉換爲指數* – minks

回答

2

有兩種方法可以解決此錯誤:

第一種方式:

投下你的數據到numpy的數組:

import numpy as np 
[...] 
train = np.array(train) 
labels = np.array(labels) 

那麼它應該與當前的代碼打交道。

方式二:

使用列表解析索引列車&標籤列表與train_index &的test_index列表

for train_index, test_index in kf: 
    X_train, X_test = [train[i] for i in train_index],[train[j] for j in test_index] 
    y_train, y_test = [labels[i] for i in train_index],[labels[j] for j in test_index] 

(對於這個解決方案也看到相關的問題index list with another list