2017-08-14 41 views
1

我有意使用非唯一索引的數據幀。我想對密鑰相同的行進行操作,如下所示。 對於每個唯一的密鑰,我想將每個「數字」列中的第一個「其他數字」相加。這可能沒有拆分數據幀或其他耗時的操作?在熊貓數據幀中對於相同索引的重複操作

import pandas as pd 


d = {'key':['a', 'a', 'b','b'], 
    'numbers':[10,20,30,40], 
    'other_numbers':[1,2,3,4] 
    } 

df = pd.DataFrame(data=d) 
df = df.set_index('key') 

print df 

##  numbers other_numbers new 
## key 
## a  10    1  11 
## a  20    2  21 
## b  30    3  33 
## b  40    4  43 

回答

1

可以使用duplicated什麼是用於爲NaNs過濾由maskother_numbers列複製指數的第一個值,這是由ffill代替(fillnamethod='ffill') :

df['new'] = df['numbers'] + df['other_numbers'].mask(df.index.duplicated()).ffill().astype(int) 
print (df) 
    numbers other_numbers new 
key        
a   10    1 11 
a   20    2 21 
b   30    3 33 
b   40    4 43 

計時

np.random.seed(123) 

N = 1000000 

df = pd.DataFrame({'numbers': np.random.randint(20,size=N), 
        'other_numbers': np.random.randint(10,size=N)}, 
        index=np.random.randint(20000,size=N)).sort_index() 
df.index.name = 'key' 
print (df) 

In [83]: %timeit df['new'] = df['numbers'] + df['other_numbers'].mask(df.index.duplicated()).ffill().astype(int) 
10 loops, best of 3: 34.8 ms per loop 

In [84]: %timeit df.assign(new1=df.groupby('key')['other_numbers'].transform('first')+df['numbers']) 
10 loops, best of 3: 64.7 ms per loop 
1

一個方法是:

In [28]: df.assign(new=df.groupby('key')['other_numbers'].transform('first')+df['numbers']) 
Out[28]: 
    numbers other_numbers new 
key 
a   10    1 11 
a   20    2 21 
b   30    3 33 
b   40    4 43