熊貓：基於條件的高效多重播放行

我想根據條件列將DataFrame中的一行乘以。熊貓：基於條件的高效多重播放行

例如，當在狀態列的值是2，我想與每個新行中的兩個相同的行和設定的條件來替代行爲1

實施例數據幀：

df = pd.DataFrame({'k': ['K0', 'K1', 'K1', 'K2'], 
       'condition': [1, 1, 3, 2], 
       's': ['a', 'b', 'c', 'd']}) 


    condition k s 
      1 K0 a 
      1 K1 b 
      3 K1 c 
      2 K2 d

期望的結果：

condition k s 
      1 K0 a 
      1 K1 b 
      1 K1 c 
      1 K1 c 
      1 K1 c 
      1 K2 d 
      1 K2 d

難道這操作來完成inplace有效，無需創建臨時df？

來源

2016-04-21 Toren

更快是使用loc和np.repeat：

df = df.loc[np.repeat(df.index.values,df.condition)].reset_index(drop=True) 
df['condition'] = 1 
print df 
    condition k s 
0   1 K0 a 
1   1 K1 b 
2   1 K1 c 
3   1 K1 c 
4   1 K1 c 
5   1 K2 d 
6   1 K2 d

在柱的另一解決方案與groupby與concat和最後一個設定值condition到1，但它是較慢：

df = df.groupby('condition', as_index=False, sort=False) 
     .apply(lambda x: pd.concat([x]*x.condition.values[0], ignore_index=True)) 
     .reset_index(drop=True) 
df['condition'] = 1 
print df 
    condition k s 
0   1 K0 a 
1   1 K1 b 
2   1 K1 c 
3   1 K1 c 
4   1 K1 c 
5   1 K2 d 
6   1 K2 d

時序：

In [917]: %timeit df.loc[np.repeat(df.index.values,df.condition)].reset_index(drop=True) 
The slowest run took 4.55 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 1.04 ms per loop 

In [918]: %timeit df.groupby('condition', as_index=False, sort=False).apply(lambda x: pd.concat([x]*x.condition.values[0], ignore_index=True)).reset_index(drop=True) 
100 loops, best of 3: 7.78 ms per loop

來源

2016-04-21 07:55:19 jezrael

謝謝@jezrael，我喜歡你的解決方案。我同意你根據第二個變體，'groupby'看起來較慢 – Toren

熊貓：基於條件的高效多重播放行

回答

相關問題