2017-06-06 85 views
3

我有一個數據框下面。我想放棄的重複,而是從E列添加的值複製到非複製的記錄刪除重複並添加值大熊貓

import pandas as pd 
import numpy as np 
dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,6,7], 
        'B' : [1,1,3,5,0,0,np.NaN,9,0,0], 
        'C' : ['AA1233445','AA1233445', 'rmacy','Idaho Rx','Ab123455','TV192837','RX','Ohio Drugs','RX12345','USA Pharma'], 
        'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN], 
        'E' : ['Assign','Allign','Hello','Ugly','Appreciate','Undo','Testing','Unicycle','Pharma','Unicorn',]}) 
print(dfp) 

我抓住所有的副本:

df2 = dfp.loc[(dfp['A'].duplicated(keep=False))].copy() 

    A B   C   D   E 
0 NaN 1.0 AA1233445 123456.0  Assign 
1 NaN 1.0 AA1233445 123456.0  Allign 
2 3.0 3.0  rmacy 1234567.0  Hello 
4 5.0 0.0 Ab123455  12345.0 Appreciate 
5 5.0 0.0 TV192837  12345.0  Undo 
6 3.0 NaN   RX 12345678.0  Testing 

,並希望我的結局是:

 A B   C   D   E 
0 NaN 1.0 AA1233445 123456.0  Assign Allign 
2 3.0 3.0  rmacy 1234567.0  Hello Testing 
4 5.0 0.0 Ab123455  12345.0  Appreciate Undo 

我知道我需要使用dfp.loc[(dfp['A'].duplicated(keep='last'))].copy()搶第一次出現,但我不能設置E科拉姆的價值n以包含其他重複的值。

我想我需要嘗試類似:

df3 = dfp.loc[(dfp['A'].duplicated(keep='last'))].copy() 
df3['E'] = df3['E'] + dfp.loc[(dfp['A'].duplicated(keep=False).copy()),'E'] 

,但我的輸出是:

 A B   C   D      E 
0 NaN 1.0 AA1233445 123456.0   AssignAssign 
2 3.0 3.0  rmacy 1234567.0   HelloHello 
4 5.0 0.0 Ab123455 12345.0 AppreciateAppreciate 

我難倒。我過於複雜嗎?如何獲得我正在查找的輸出,以便稍後刪除除第一個之外的所有副本,但是將「已保存」的值存儲在E列中?

回答

3

定義要在agg中使用並在groupby內使用的函數。爲了讓groupby和NaN一起工作,我轉換爲字符串然後返回到浮動。

f = {c: ' '.join if c == 'E' else 'first' for c in ['B', 'C', 'D', 'E']} 

dfp.groupby(
    dfp.A.astype(str), sort=False 
).agg(f).reset_index().eval(
    'A = @pd.to_numeric(A, "coerce").values', 
    inplace=False 
) 

    A B   C   D    E 
0 NaN 1.0 AA1233445  123456.0 Assign Allign 
1 3.0 3.0  rmacy 1234567.0 Hello Testing 
2 4.0 5.0 Idaho Rx 12345678.0    Ugly 
3 5.0 0.0 Ab123455  12345.0 Appreciate Undo 
4 1.0 9.0 Ohio Drugs 123456789.0   Unicycle 
5 6.0 0.0  RX12345 1234567.0   Pharma 
6 7.0 0.0 USA Pharma   NaN   Unicorn 

限制它只是重複的行:

f = {c: ' '.join if c == 'E' else 'first' for c in ['B', 'C', 'D', 'E']} 
d1 = dfp[dfp.duplicated('A', keep=False)] 
d2 = d1.groupby(d1.A.astype(str), sort=False).agg(f).reset_index() 
d2.A = d2.A.astype(float) 

D2

 A B   C   D    E 
0 NaN 1.0 AA1233445 123456.0 Assign Allign 
1 3.0 3.0  rmacy 1234567.0 Hello Testing 
2 5.0 0.0 Ab123455 12345.0 Appreciate Undo 
+0

噢,天哪...也許我沒有過於複雜的事情。哈哈,這似乎很激烈。我得看看這個!一如既往的感謝你! – MattR

+0

多麼優雅的解決方案。 –

+0

@ScottBoston謝謝 – piRSquared

3

這裏是我的醜陋的解決方案:

In [263]: (dfp.reset_index() 
    ...:  .assign(A=dfp.A.fillna(-1)) 
    ...:  .groupby('A') 
    ...:  .filter(lambda x: len(x) > 1) 
    ...:  .groupby('A', as_index=False) 
    ...:  .apply(lambda x: x.head(1).assign(E=x.E.str.cat(sep=' '))) 
    ...:  .replace({'A':{-1:np.nan}}) 
    ...:  .set_index('index')) 
    ...: 
Out[263]: 
     A B   C   D    E 
index 
0  NaN 1.0 AA1233445 123456.0 Assign Allign 
2  3.0 3.0  rmacy 1234567.0 Hello Testing 
4  5.0 0.0 Ab123455 12345.0 Appreciate Undo 
+0

就像我在其他答案中提到的,我想這不是一個簡單的任務。我很欣賞這一點,我會仔細研究一下。我將不得不'逆向工程',並看到你的思維過程:) – MattR

+0

哇....現在,這是令人印象深刻的。 –

+0

@ScottBoston,我不喜歡這個解決方案,但謝謝! :) – MaxU