2016-01-14 48 views
3

我有兩個熊貓數據幀df1和df2。我希望df1使用左外連接與df2結合,但使用包含「df2.Full_Key」中的「df2.Partial_key」功能的函數。左外連接熊貓數據幀使用包含

Select df1.data_id1, df1.Full_Key, df1.text_field 
, df2.data_id2, df2.text_field 
from df1 
LEFT OUTER JOIN df2 on "df1.Full_Key contains df2.Partial_key" 

有沒有辦法做到這一點沒有for循環?鑑於

df1 = pd.DataFrame.from_items([('data_id1' , ['bzx_0001','bzx_0002','bzx_0003','bzx_0004']) 
, ('Full_Key_1',['AAAA-BBBB-20150101-NS237890', 'BBBB-CCCC-21050101-MS18546', 'CCCC-CCCC-20150101-MS34567', 'CCCC-CCCC-20150101-MS34568']) 
, ('text_field',['aaaaa', 'bbbbb', 'cccccc', 'ddddd'])]) 

df2 = pd.DataFrame.from_items([('data_id2',['dm_0001', 'dm_0002', 'dm_0003', 'dm_0004']) 
,('Partial_key',['AAAA-BBBB-20150101-', 'AAAA-BBBB-20150101-', 'BBBB-CCCC-21050101-', 'XXXX-XXXX-20150101-']) 
]) 

數據幀預期加盟後:使用循環

df_exp_res = pd.DataFrame.from_items([ 
('data_id1', ['bzx_0001', 'bzx_0001', 'bzx_0002', 'bzx_0003', 'bzx_0004']) 
,('Full_Key_1',['AAAA-BBBB-20151005-NS237890', 'AAAA-BBBB-20151005-NS237890', 'BBBB-CCCC-21050101-MS18546', 'CCCC-CCCC-20150101-MS34567', 'CCCC-CCCC-20150101-MS34568']) 
,('text_field',['aaaaa', 'aaaaa', 'bbbbb', 'cccccc', 'ddddd']) 
,('data_id2', ['dm_0001', 'dm_0002', 'dm_0003', np.nan, np.nan]) 
,('Partial_key',['AAAA-BBBB-20151005-', 'AAAA-BBBB-20151005-', 'BBBB-CCCC-21050101-', np.nan, np.nan]) 
]) 

我的解決辦法:

s = [['data_id1' , 'Full_Key_1', 'text_field', 'Partial_key', 'data_id2']] 
for indx1, row1 in df1.iterrows(): 
    fnd = False 
    for indx2, row2 in df2.iterrows(): 
     if row2['Partial_key'].strip() in row1['Full_Key_1'].strip(): 
      s.append([row1['data_id1'],row1['Full_Key_1'], \ 
      row1['text_field'], row2['Partial_key'], \ 
      row2['data_id2']]) 
      fnd = True 
     else: 
      pass 
    else: 
     if not fnd: 
      s.append([row1['data_id1'],row1['Full_Key_1'], \ 
      row1['text_field'], np.nan, np.nan]) 

pd_result_calc = pd.DataFrame(s[1:],columns=s[0]) 
print df1 
print df2 
print pd_result_calc 
+0

'Partial_key'總是截斷'Full_ke y's?他們總是19個字符? – unutbu

+0

Partial_Keys和Full_Keys沒有固定長度。只要整個Partial_Key包含在被認爲匹配的Full_Key中。但是,是的,Partial_Key將始終是表格的Full_Key的截斷:Partial_key = Full_Key [0:some_n]其中0

回答

0

基於交叉聯接 - 見cartesian product in pandas

df1 = pd.DataFrame.from_items([('data_id1' , ['bzx_0001','bzx_0002','bzx_0003','bzx_0004']) 
, ('Full_Key_1',['AAAA-BBBB-20150101-NS237890', 'BBBB-CCCC-21050101-MS18546', 'CCCC-CCCC-20150101-MS34567', 'CCCC-CCCC-20150101-MS34568']) 
, ('text_field',['aaaaa', 'bbbbb', 'cccccc', 'ddddd'])]) 

df2 = pd.DataFrame.from_items([('data_id2',['dm_0001', 'dm_0002', 'dm_0003', 'dm_0004']) 
,('Partial_key',['AAAA-BBBB-20150101-', 'AAAA-BBBB-20150101-', 'BBBB-CCCC-21050101-', 'XXXX-XXXX-20150101-']) 
]) 

df1['key'] =1 
df2['key'] =1 

merged_cross_join = pd.merge(df1, df2,on='key') 

# we don't need this helper column 'key' any longer 
merged_cross_join.drop('key', axis=1, inplace=True) 
df1.drop('key', axis=1, inplace=True) 

contains_criteria = merged_cross_join[['Full_Key_1','Partial_key']].apply(lambda x: x['Partial_key'] in x['Full_Key_1'],axis=1) 
print merged_cross_join[contains_criteria] 

將會產生:

data_id1     Full_Key_1 text_field key data_id2   Partial_key 
0 bzx_0001 AAAA-BBBB-20150101-NS237890  aaaaa 1 dm_0001 AAAA-BBBB-20150101- 
1 bzx_0001 AAAA-BBBB-20150101-NS237890  aaaaa 1 dm_0002 AAAA-BBBB-20150101- 
6 bzx_0002 BBBB-CCCC-21050101-MS18546  bbbbb 1 dm_0003 BBBB-CCCC-21050101- 

,然後因爲你要像一個 「左外連接:」 我們不希望從DF1

not_matched_in_df1 = set(df1['data_id1']) - set(merged_cross_join['data_id1']) 
final = pd.concat([merged_cross_join,df1[df1['data_id1'].isin(not_matched_in_df1)]],axis=0) 

或可替代

merged_cross_join.combine_first(df1) 

產生

任何鬆動
data_id1     Full_Key_1 text_field data_id2   Partial_key 
0 bzx_0001 AAAA-BBBB-20151005-NS237890  aaaaa dm_0001 AAAA-BBBB-20151005- 
1 bzx_0001 AAAA-BBBB-20151005-NS237890  aaaaa dm_0002 AAAA-BBBB-20151005- 
2 bzx_0002 BBBB-CCCC-21050101-MS18546  bbbbb dm_0003 BBBB-CCCC-21050101- 
3 bzx_0003 CCCC-CCCC-20150101-MS34567  cccccc  NaN     NaN 
4 bzx_0004 CCCC-CCCC-20150101-MS34568  ddddd  NaN     NaN 
+0

這會丟棄df1.Full_Keys沒有找到匹配的左外部聯接部分。 –

+0

看到現在更新。 – Dickster