2014-01-14 72 views
3

我想將觀察樣本變成n離散組,然後組合這些組,直到每個子組的最小值爲6個成員。到目前爲止,我已經生成分級,分組和數據幀我放進去:使用熊貓將數值歸併爲最小大小的組

# df is a DataFrame containing 135 measurments 
bins = np.linspace(df.heights.min(), df.heights.max(), 21) 
grp = df.groupby(np.digitize(df.heights, bins)) 
grp.size() 

1  4 
2  1 
3  2 
4  3 
5  2 
6  8 
7  7 
8  6 
9  19 
10 12 
11 13 
12 12 
13  7 
14 12 
15 12 
16  2 
17  3 
18  6 
19  3 
21  1 

所以我可以看到,我需要基團結合1 - 3,3 - 5,16 - 21,而保留其他完好無損,但我不知道如何以編程方式執行此操作。

回答

2

你可以這樣做:

df = pd.DataFrame(np.random.random_integers(1,200,135), columns=['heights']) 
bins = np.linspace(df.heights.min(), df.heights.max(), 21) 
grp = df.groupby(np.digitize(df.heights, bins)) 
sizes = grp.size() 

def f(vals, max): 
    sum = 0 
    group = 1 
    for v in vals: 
     sum += v 
     if sum <= max: 
      yield group 
     else: 
      group +=1 
      sum = v 
      yield group 

#I've changed 6 by 30 for the example cause I don't have your original dataset 
grp.size().groupby([g for g in f(sizes, 30)]) 

如果你這樣做print grp.size().groupby([g for g in f(sizes, 30)]).cumsum()你會看到如預期的累計總和進行分組。

此外,如果你想組的原始值,你可以這樣做:

dat = np.random.random_integers(0,200,135) 
dat = np.array([78,116,146,111,147,78,14,91,196,92,163,144,107,182,58,89,77,134, 
83,126,94,70,121,175,174,88,90,42,93,131,91,175,135,8,142,166, 
1,112,25,34,119,13,95,182,178,200,97,8,60,189,49,94,191,81, 
56,131,30,107,16,48,58,65,78,8,0,11,45,179,151,130,35,64, 
143,33,49,25,139,20,53,55,20,3,63,119,153,14,81,93,62,162, 
46,29,84,4,186,66,90,174,55,48,172,83,173,167,66,4,197,175, 
184,20,23,161,70,153,173,127,51,186,114,27,177,96,93,105,169,158, 
83,155,161,29,197,143,122,72,60]) 
df = pd.DataFrame({'heights':dat}) 
bins = np.digitize(dat,np.linspace(0,200,21)) 
grp = df.heights.groupby(bins) 

m = 15 #you should put 6 here, the minimun 
s = 0 
c = 1 
def f(x): 
    global c,s 
    res = pd.Series([c]*x.size,index=x.index) 
    s += x.size 
    if s>m: 
     s = 0 
     c += 1 
    return res 
g = grp.apply(f) 
print df.groupby(g).size() 

#another way of doing the same, just a matter of taste 

m = 15 #you should put 6 here, the minimun 
s = 0 
c = 1 
def f2(x): 
    global c,s 
    res = [c]*x.size #here is the main difference with f 
    s += x.size 
    if s>m: 
     s = 0 
     c += 1 
    return res 

g = grp.transform(f2) #call it this way 
print df.groupby(g).size() 
+0

這看起來很完美。雖然listcomp中的'sizes' var是指什麼? – urschrei

+1

哦,對不起,我更新了代碼。如果你想分組原始數據,我還發布了另一種方法。對不起,我已經離線了。 –

+0

嗯你的第二個例子給出了一個AttributeError:'AttributeError:'DataFrame'對象沒有屬性'size',其中'grp.apply(f)'被調用。 – urschrei