2016-02-09 152 views
0

我前面用cross_validation.train_test_split到我的數據集拆分成一種90:10比例具體的測試規模。我現在轉移到分層隨機分組(在scikit-learn中合併了Kfold和Shuffle Split)。我想知道,如果這樣做是否分層劃分與指定的測試尺寸比較好,或者我應該只是做沒有speicfying測試的大小?交叉驗證與

這是我在做什麼:

train=[] 
with open("/Users/minks/Documents/documents.txt") as f: 
    for line in f: 
     train.append(line.strip().split()) 
train=np.array(train) 
labels=[] 
with open("/Users/minks/Documents/Labels.txt") as t: 
    for line in t: 
     labels.extend(line.strip().split()) 
labels=np.array(labels) 

kf=StratifiedShuffleSplit(labels, n_iter=5, test_size=0.10) 

for train_index, test_index in kf: 
    X_train, X_test = train[train_index],train[test_index] 
    Y_train, Y_test = labels[train_index],labels[test_index] 

我想知道,如果指定test_size是性能良好的決策或不因爲如果我不這樣做它拿起隨機比率。

回答

0

如果不指定自己的測試規模,它會默認爲0.1。它不會使用隨機比率。您可以在文檔的默認值(函數的特林):

在IPython的

,做

[1]: from sklearn.cross_validation import StratifiedShuffleSplit 
[2]: StratifiedShuffleSplit? 

你會看到

[...] 
Parameters 
---------- 
n : int 
    Total number of elements in the dataset. 

n_iter : int (default 10) 
    Number of re-shuffling & splitting iterations. 

test_size : float (default 0.1), int, or None 
    If float, should be between 0.0 and 1.0 and represent the 
    proportion of the dataset to include in the test split. If 
    int, represents the absolute number of test samples. If None, 
    the value is automatically set to the complement of the train size. 

[...]