隨機樣本集創建交叉驗證和基於標籤的訓練集

我試圖從訓練集中提取隨機樣本60:20:20以創建火車，交叉驗證和測試集。隨機樣本集創建交叉驗證和基於標籤的訓練集

我用下面的代碼：

train=data.sample(frac=0.6) 
trcv=data.drop(train.index) 
test=trcv.sample(frac=0.5) 
cv=trcv.drop(test.index)

但是我意識到，我的訓練集是一個監督的學習資料和數據幀的最後一列包含值1或標籤Y（列名） 0.

我想要創建訓練，測試和交叉驗證集的方式是我想將y = 0和y = 1的樣本賦值爲.99：.1並將其分配給訓練集。這意味着如果訓練集有100條記錄，我希望99條記錄是y = 0，而只有一條記錄是y = 1。

其中y = 1需要將剩餘的99次％的記錄被分割順便說一句交叉驗證和測試設置爲45％，44％

一個做的可能的方式是創建一個包含記錄與柱的副本的數據幀Y的值爲1，然後從y = 1的主數據框中刪除所有記錄。

Y1=data[data.iloc[:,8]==1] 
data=data[data.iloc[:,8]!=1]

然後將上述樣本分佈應用於cv，test和training集合。

train=data.sample(frac=0.6) 
trcv=data.drop(train.index) 
test=trcv.sample(frac=0.5) 
cv=trcv.drop(test.index)

現在樣品從0.1：0.44：從數據幀0.45，其中y = 1

ycvT=Y1.sample(frac=0.99) 
ytr=Y1.drop(ycvT.index) 
ytest= ycvT.sample(frac=0.45) 
ycv= ycvT.drop(ytest.index)

這將產生含有Y = 1 3個不同dataframes。

Now I can add them to the training , cross validation and test set. 
train=train.append(ytr) 
train=train.sample(frac=1).reset_index(drop=True)

..以及用於cv和測試集。

我想知道是否有一個更聰明（更短）的方式來做到這一點。我想限制自己熊貓，裸體和scipy。

任何提示？謝謝。

來源

2017-08-09 sunny

y = data.iloc[:, -1].values 
g = data.groupby(y) 

frac = .2 

ones = g.get_group(1).sample(frac=frac) 
zero = g.get_group(0).sample(len(ones) * 99) 

train = pd.concat([ones, zero]).sample(frac=1)

來源

2017-08-09 01:12:06 piRSquared

那太棒了！應該壓裂= 0.1？。 – sunny

@sunny壓裂是你想拉的任何部分 – piRSquared

隨機樣本集創建交叉驗證和基於標籤的訓練集

回答

相關問題