Python將數據拆分爲隨機集合

我想將我的數據拆分爲兩個隨機集合。我已經做了第一部分：Python將數據拆分爲隨機集合

ind = np.random.choice(df.shape[0], size=[int(df.shape[0]*0.7)], replace=False) 
X_train = df.iloc[ind]

現在我想選擇所有指數」不ind創建我的測試集。請你能告訴我該怎麼做？

我認爲這將是

X_test = df.iloc[-ind]

但顯然它不是

來源

2017-05-29 jlt199

所以你想選擇70％作爲測試數據，其餘30％作爲訓練數據？一個更簡單的方法可能是使用np.random.shuffle來混洗索引，並使用前70％的混洗索引作爲訓練和休息作爲測試。 –

是的，這正是我想要的 – jlt199

試試這個純Python的方法。

ind_inversed = list(set(range(df.shape[0])) - set(ind)) 
X_test = df.iloc[ind_inversed]

來源

2017-05-29 15:48:07

這不會隨機化這兩組 –

因爲我認爲'ind'的計算方式與原始問題相同。 'ind_inversed'表示不在'ind'中的所有其他indecies。 –

你說得對，對不起！ –

退房scikit-learntest_train_split()

從文檔實例：

>>> import numpy as np 
>>> from sklearn.model_selection import train_test_split 
>>> X, y = np.arange(10).reshape((5, 2)), range(5) 
>>> X 
array([[0, 1], 
     [2, 3], 
     [4, 5], 
     [6, 7], 
     [8, 9]]) 
>>> list(y) 
[0, 1, 2, 3, 4] 

>>> 

>>> X_train, X_test, y_train, y_test = train_test_split(
...  X, y, test_size=0.33, random_state=42) 
... 
>>> X_train 
array([[4, 5], 
     [0, 1], 
     [6, 7]]) 
>>> y_train 
[2, 0, 3] 
>>> X_test 
array([[2, 3], 
     [8, 9]]) 
>>> y_test 
[1, 4]

你的情況，你可以做這樣的：

larger, smaller = test_train_split(df, test_size=0.3)

來源

2017-05-29 15:49:16

另一種方式來獲得一個70 - 30列車測試拆分將產生指標，隨機洗牌，然後sp點燃70 - 30份。

ind = np.arange(df.shape[0]) 
np.random.shuffle(ind) 
X_train = df.iloc[ind[:int(0.7*df.shape[0])],:] 
X_test = df.iloc[ind[int(0.7*df.shape[0]):],:]

我建議轉換pandas.dataframe爲數字矩陣，並使用scikit學習的train_test_split做拆分，除非你真的想這樣做這樣。

來源

2017-05-29 15:54:38

我喜歡這種方法。謝謝。我之前使用過'train_test_split'（儘管我已經忘記了它），但是我發現數據更易於在數據框中進行檢查和可視化。 – jlt199

Python將數據拆分爲隨機集合

回答

相關問題