2017-07-24 197 views
0

我希望在多個條件下合併通過sql獲取的數據幀。Python Pandas:在多個條件下合併數據幀

  • df1:第一個df包含Customer ID,Cluster ID和Customer Zone ID。
  • 第二個df包含投訴ID,註冊號。

的DF1和DF2如下所示:

DF1

Customer ID  Cluster ID Customer Zone ID 
CUS1001.A  CUS1001.X CUS1000 
CUS1001.B  CUS1001.X CUS1000 
CUS1001.C  CUS1001.X CUS1000 
CUS1001.D  CUS1001.X CUS1000 
CUS1001.E  CUS1001.X CUS1000 
CUS2001.A  CUS2001.X CUS2000 

DF2:

Complain ID RegistrationNumber Status 
CUS3501.A  99231   open 
CUS1001.B  21340   open 
CUS1001.X0   open 

我要合併用以下的條件這兩個數據幀:

if(Complain ID == Customer ID): 
    Merge on Customer ID 
Elif(Complain ID == Cluster ID): 
    Merge on Customer ID 
Elif (Complain ID == Customer Zone ID): 
    Merge on Customer ID 
Else: 
    Merge empty row. 

最終結果應該是這樣的:

Customer ID Cluster ID Customer Zone ID Complain ID Regi ID Status 
CUS1001.A CUS1001.X  CUS1000   CUS1001.X0 open 
CUS1001.B CUS1001.X  CUS1000   CUS1001.B 21340 open 
CUS1001.C CUS1001.X  CUS1000   CUS1001.X0 open 
    .    .    .    .   .  . 
    .    .    .    .   .  . 
CUS2001.A CUS2001.X  CUS2000    0   0  0 

請幫忙!

回答

1

嘗試......使用pandasmeltmergeconcat

df=pd.melt(df1) 
df=df.merge(df2,left_on='value',right_on='Complain ID',how='left') 
df['number']=df.groupby('variable').cumcount() 
df=df.groupby('number').bfill() 
Target=pd.concat([df1,df.iloc[:5,2:6]],axis=1).fillna(0).drop('number',axis=1) 

Target 
Out[39]: 
    Customer ID Cluster ID Customer Zone ID Complain ID RegistrationNumber \ 
0 CUS1001.A CUS1001.X   CUS1000 CUS1001.X0.0 
1 CUS1001.B CUS1001.X   CUS1000 CUS1001.B    21340.0 
2 CUS1001.C CUS1001.X   CUS1000 CUS1001.X0.0 
3 CUS1001.D CUS1001.X   CUS1000 CUS1001.X0.0 
4 CUS1001.E CUS1001.X   CUS1000 CUS1001.X0.0 
5 CUS2001.A CUS2001.X   CUS2000   0     0.0 
    Status  
0 open   
1 open   
2 open   
3 open   
4 open   
5  0   

更新 通過使用numpy的的intersect1d,我個人很喜歡這種方法最爲比前一個。

df1.MatchId=[np.intersect1d(x,df2.ComplainID.values) for x in df1[['CustomerID','ClusterID']].values] 
df1.MatchId=df1.MatchId.apply(pd.Series) 
df1 
Out[307]: 
    CustomerID ClusterID CustomerZoneID MatchId 
0 CUS1001.A CUS1001.X  CUS1000 CUS1001.X 
1 CUS1001.B CUS1001.X  CUS1000 CUS1001.B 
2 CUS1001.C CUS1001.X  CUS1000 CUS1001.X 
3 CUS1001.D CUS1001.X  CUS1000 CUS1001.X 
4 CUS1001.E CUS1001.X  CUS1000 CUS1001.X 
5 CUS2001.A CUS2001.X  CUS2000  NaN 

df1.merge(df2,left_on='MatchId',right_on='ComplainID',how='left') 
Out[311]: 
    CustomerID ClusterID CustomerZoneID MatchId ComplainID \ 
0 CUS1001.A CUS1001.X  CUS1000 CUS1001.X CUS1001.X 
1 CUS1001.B CUS1001.X  CUS1000 CUS1001.B CUS1001.B 
2 CUS1001.C CUS1001.X  CUS1000 CUS1001.X CUS1001.X 
3 CUS1001.D CUS1001.X  CUS1000 CUS1001.X CUS1001.X 
4 CUS1001.E CUS1001.X  CUS1000 CUS1001.X CUS1001.X 
5 CUS2001.A CUS2001.X  CUS2000  NaN  NaN 
    RegistrationNumber Status 
0.0 open 
1    21340.0 open 
0.0 open 
0.0 open 
0.0 open 
5     NaN NaN 
+0

任何替代方法可以做同樣的事情嗎? –

+0

@ShawnNash你可以'def'你自己的功能這個特殊的'merge'過程 – Wen

+0

@ShawnNash如果你仍然感興趣,你可以檢查我的更新 – Wen