2017-05-17 115 views
2

我有一個熊貓數據幀,如:pandas:如何在按列分組後獲得第一個正數?

 a b id 
1 10 6 1 
2  6 -3 1 
3 -3 12 1 # First time id 1 has a b value over 10 
4  4 23 2 # First time id 2 has a b value over 10 
5 12 11 2 
6  3 -5 2 

將如何使一個新的數據幀,其第一個採取id列,然後得到第一次列b超過10,這樣的結果會是這樣的:

 a b id 
1 -3 12 1 
2  4 23 2 

我有像200萬行和10,000 id值的數據幀,這樣一個循環是很慢的。

+0

可能有的組沒有'> 10'值? – jezrael

回答

4

先用快速boolean indexing進行過濾,然後groupby + first

df = df[df['b'] > 10].groupby('id', as_index=False).first() 
print (df) 
    id a b 
0 1 -3 12 
1 2 4 23 

解決方案是一個有點複雜,如果在某些基團如10沒有更大的價值 - 需要擴大面具與duplicated

print (df) 
    a b id 
1 7 6 3 <- no value b>10 for id=3 
1 10 6 1 
2 6 -3 1 
3 -3 12 1 
4 4 23 2 
5 12 11 2 
6 3 -5 2 

mask = ~df['id'].duplicated(keep=False) | (df['b'] > 10) 
df = df[mask].groupby('id', as_index=False).first() 
print (df) 
    id a b 
0 1 -3 12 
1 2 4 23 
2 3 7 6 

計時

#[2000000 rows x 3 columns] 
np.random.seed(123) 
N = 2000000 
df = pd.DataFrame({'id': np.random.randint(10000, size=N), 
        'a':np.random.randint(10, size=N), 
        'b':np.random.randint(15, size=N)}) 
#print (df) 


In [284]: %timeit (df[df['b'] > 10].groupby('id', as_index=False).first()) 
10 loops, best of 3: 67.6 ms per loop 

In [285]: %timeit (df.query("b > 10").groupby('id').head(1)) 
10 loops, best of 3: 107 ms per loop 

In [286]: %timeit (df[df['b'] > 10].groupby('id').head(1)) 
10 loops, best of 3: 90 ms per loop 

In [287]: %timeit df.query("b > 10").groupby('id', as_index=False).first() 
10 loops, best of 3: 83.3 ms per loop 

#without sorting a bit faster 
In [288]: %timeit (df[df['b'] > 10].groupby('id', as_index=False, sort=False).first()) 
10 loops, best of 3: 62.9 ms per loop 
+1

這很好! :) – MaxU

4
In [146]: df.query("b > 10").groupby('id').head(1) 
Out[146]: 
    a b id 
3 -3 12 1 
4 4 23 2 
1

最後一列進行排序的情況下,這是一個使用np.searchsorted一個NumPy的解決方案 -

def numpy_searchsorted(df, thresh=10): 
    a = df.values 
    m = a[:,1] > thresh 
    mask_idx = np.flatnonzero(m) 

    b = a[mask_idx,2] 
    unq_ids = b[np.concatenate(([True], b[1:] != b[:-1]))] 
    idx = np.searchsorted(b, unq_ids) 
    out = a[mask_idx[idx]] 
    return pd.DataFrame(out, columns = df.columns) 

運行測試 -

In [2]: np.random.seed(123) 
    ...: N = 2000000 
    ...: df = pd.DataFrame({'id': np.sort(np.random.randint(10000, size=N)), 
    ...:     'a':np.random.randint(10, size=N), 
    ...:     'b':np.random.randint(15, size=N)}) 
    ...: 

# @MaxU's soln 
In [3]: %timeit df.query("b > 10").groupby('id').head(1) 
10 loops, best of 3: 44.8 ms per loop 

# @jezrael's best soln that assumes last col as sorted too 
In [4]: %timeit (df[df['b'] > 10].groupby('id', as_index=False, sort=False).first()) 
10 loops, best of 3: 30.1 ms per loop 

# Proposed in this post 
In [5]: %timeit numpy_searchsorted(df) 
100 loops, best of 3: 12.6 ms per loop 
相關問題