2015-04-02 25 views
1

我需要將數據幀拆分爲10個部分,然後使用一個部分作爲測試集並保留9(合併爲用作訓練集),我有在我能夠分割數據集的地方找到下面的代碼,並且在選擇其中的一個之後嘗試合併其餘的集合。 第一次迭代沒問題,但在第二次迭代中我得到了下面的錯誤。將數據幀拆分爲10個相等的部分,並在循環中每次選取一個合併9個部分

df = pd.DataFrame(np.random.randn(10, 4), index=list(xrange(10))) 

for x in range(3): 
    dfList = np.array_split(df, 3) 
    testdf = dfList[x] 
    dfList.remove(dfList[x]) 
    print testdf 
    traindf = pd.concat(dfList) 
    print traindf 
    print "================================================" 

enter image description here

+0

爲什麼不scikit學習交叉驗證? http://scikit-learn.org/stable/modules/cross_validation.html#random-permutations-cross-validation-a-k-a-shuffle-split – 2015-04-02 03:12:02

+0

我這樣做是作爲課程的一部分,並試圖實現驗證的任務。 – 2015-04-02 03:14:29

回答

0

好吧,我得到它的工作是這樣的:

df = pd.DataFrame(np.random.randn(10, 4), index=list(xrange(10))) 

dfList = np.array_split(df, 3) 
for x in range(3): 
    trainList = [] 
    for y in range(3): 
     if y == x : 
      testdf = dfList[y] 
     else: 
      trainList.append(dfList[y]) 
    traindf = pd.concat(trainList) 
    print testdf 
    print traindf 
    print "================================================" 

但更好的方法是值得歡迎的。

enter image description here

1

我不認爲你有10分裂的數據幀,但只是在第2 我用這個代碼在訓練集和驗證集拆分數據幀:

的test_index = NP .random.choice(df.index,INT(LEN(df.index)/ 10),更換=假)

test_df = df.loc [的test_index]

train_df = df.loc [〜DF。 index.isin(test_index)]

+0

這是一個更好的解決方案 – 2015-04-02 14:38:48

+0

@Haleemur阿里---如果我需要將它分成1:9一次,這是很好的------這是隨機選擇1/10作爲測試集,但是,我試圖實現k-fold驗證,據我所知:您將數據分解爲K塊。然後,對於K = 1到X,您將第K個塊作爲測試塊,其餘數據成爲訓練數據。訓練,測試,記錄並更新K. – 2015-04-02 18:51:05

+0

您可以將數據幀索引拆分爲塊並循環。見http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python – Spas 2015-04-03 13:19:13

0

可以使用permutation函數從numpy.random

import numpy as np 
import pandas as pd 
import math as mt 
l = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] 
df = pd.DataFrame({'a': l, 'b': l}) 

洗牌數據幀索引

shuffled_idx = np.random.permutation(df.index)  

鴻溝shuffled_index分成N個相等(ISH)份
對於這個例子,讓N = 4

N = 4 
n = len(shuffled_idx)/N 
parts = [] 
for j in range(N): 
    parts.append(shuffled_idx[mt.ceil(j*n): mt.ceil(j*n+n)]) 

# to show each shuffled part of the data frame 
for k in parts: 
    print(df.iloc[k]) 
0

我寫了一張腳本find/fork it on github用於隨機分割熊貓數據幀。這裏的a link到熊貓 - 合併,連接和連接功能!供大家參考

同一代碼:

import pandas as pd 
    import numpy as np 

    from xlwings import Sheet, Range, Workbook 

    #path to file 
    df = pd.read_excel(r"//PATH TO FILE//") 

    df.columns = [c.replace(' ',"_") for c in df.columns] 
    x = df.columns[0].encode("utf-8") 

#number of parts the data frame or the list needs to be split into 
    n = 7 
    seq = list(df[x]) 
    np.random.shuffle(seq) 
    lists1 = [seq[i:i+n] for i in range(0, len(seq), n)] 
    listsdf = pd.DataFrame(lists1).reset_index() 

    dataframesDict = dict() 

# calling xlwings workbook function 

    Workbook() 

    for i in range(0,n): 

     if Sheet.count() < n: 

     Sheet.add() 

     doubles[i] = 

      df.loc[df.Column_Name.isin(list(listsdf[listsdf.columns[i+1]]))] 

     Range(i,"A1").value = doubles[i] 
0

看起來你正在嘗試做一個k-fold類型的事情,而不是一次性的。此代碼應該有所幫助。你也可以在你的案例中找到SKLearn k-fold功能,這也值得一試。

# Split dataframe by rows into n roughly equal portions and return list of 
# them. 
def splitDf(df, n) : 
    splitPoints = list(map(lambda x: int(x*len(df)/n), (list(range(1,n)))))  
    splits = list(np.split(df.sample(frac=1), splitPoints)) 
    return splits 

# Take splits from splitDf, and return into test set (splits[index]) and training set (the rest) 
def makeTrainAndTest(splits, index) : 
    # index is zero based, so range 0-9 for 10 fold split 
    test = splits[index] 

    leftLst = splits[:index] 
    rightLst = splits[index+1:] 

    train = pd.concat(leftLst+rightLst) 

    return train, test 

然後,您可以使用這些功能使摺痕

df = <my_total_data> 
n = 10 
splits = splitDf(df, n) 
trainTest = [] 
for i in range(0,n) : 
    trainTest.append(makeTrainAndTest(splits, i)) 

# Get test set 2 
test2 = trainTest[2][1].shape 

# Get training set zero 
train0 = trainTest[0][0] 
相關問題