2016-11-17 55 views
1

我在sklearn cross_validation train_test_split模塊中使用了一個熊貓數據框。IndexError:位置索引器超出界限sklearn test_train_split

d=pandas.DataFrame({'a':np.random.randn(300), 
        'c':np.array([el for el in np.ones(100)]+ 
           [el for el in np.zeros(200)])}) 
from sklearn import cross_validation 
(X,y)=(d['a'],d['c']) 

這工作

X_train_and_cv, X_test,y_train_and_cv,y_test = sklearn.cross_validation.train_test_split(X,y,test_size=0.2,random_state=0) 
X_train, X_cv,y_train,y_cv = sklearn.cross_validation.train_test_split(X_train_and_cv,y_train_and_cv,test_size=0.2,random_state=0) 

爲什麼不這項工作?

X_train_and_cv, X_test,y_train_and_cv,y_test = sklearn.cross_validation.train_test_split(X,y,test_size=0.2,random_state=0,stratify=y) 
X_train, X_cv,y_train,y_cv = sklearn.cross_validation.train_test_split(X_train_and_cv,y_train_and_cv,test_size=0.2,random_state=0,stratify=y) 

in _is_valid_list_like(self, key, axis) 
    1536   l = len(ax) 
    1537   if len(arr) and (arr.max() >= l or arr.min() < -l): 
-> 1538    raise IndexError("positional indexers are out-of-bounds") 
    1539 
    1540   return True 

IndexError: positional indexers are out-of-bounds 

回答

1

TL; DR:您對train_test_split第二個呼叫使用不同的數組長度爲stratify比你使用y。使用stratify=y_train_and_cv


首先,一個小側面說明:cross_validation(0.17.1文檔here)不久將被取消,你應該使用model_selection.train_test_split (0.18.1)代替。我將導入train_test_split itself縮短接下來的長度:

# Same as this in older versions: 
# from sklearn.cross_validation import train_test_split 
from sklearn.model_selection import train_test_split 

這是罰款:

X_train_and_cv, X_test,y_train_and_cv,y_test = train_test_split(X,y, 
                   test_size=0.2, 
                   random_state=0, 
                   stratify=y) 

這不是罰款,因爲y=y_train_and_cv(LEN = 240)stratify=y(LEN = 300)

X_train, X_cv,y_train,y_cv = train_test_split(X_train_and_cv, 
               y_train_and_cv, 
               test_size=0.2, 
               random_state=0, 
               stratify=y_train_and_cv) 
X_train, X_cv,y_train,y_cv = train_test_split(X_train_and_cv, 
               y_train_and_cv, 
               test_size=0.2, 
               random_state=0, 
               stratify=y) 

通過更換

+0

哇,我現在意識到我正在將'y'解釋爲一個字符串而不是可變參數 - 例如stratify ='yes' - 並且假設它是通過第二個參數推斷出to-stratify-on數組。 – user86895

+0

啊!這將是分層=真:) –