2015-04-29 72 views
2

我有數據幀的DFM「:Python熊貓:有效比較數據幀的行?

match    group 
adamant   86 
adamant   86 
adamant bild  86 
360works   94 
360works   94 

在「組」列是一樣的,我想用兩到了「匹配」列兩者的內容比較和另一列'添加的比較結果結果'。例如預期的結果是:

group  compare        result 
    86  adamant, adamant       same 
    86  adamant, adamant bild     not same 
    86  adamant, adamant bild     not same 
    94  360works,360works       same 

任何人都可以幫忙嗎?

+1

你能清理你預期的結果?我認爲格式化沒有按照您的預期發佈。無論哪種方式,似乎有點混淆 – afinit

+0

@benine對不起!我編輯了文本 – UserYmY

+0

你想在每個組中選擇每個可能的對嗎? –

回答

1

有點哈克,但它似乎爲我工作:

# initialize the list to store the dictionaries 
# that will create the new DataFrame 
new_df_dicts = [] 

# group on 'group' 
for group, indices in dfm.groupby('group').groups.iteritems(): 
    # get the values in the 'match' column 
    vals = dfm.ix[indices]['match'].values 
    # choose every possible pair from the array of column values 
    for i in range(len(vals)): 
     for j in range(i+1, len(vals)): 
      # compute the new values 
      compare = vals[i] + ', ' + vals[j] 
      if vals[i] == vals[j]: 
       result = 'same' 
      else: 
       result = 'not same' 
      # append the results to the DataFrame 
      new_df_dicts.append({'group': group, 'compare': compare, 'result': result}) 

# create the new DataFrame 
new_df = DataFrame(new_df_dicts) 

這裏是我的輸出:

    compare group result 
0  360works, 360works  94  same 
1  adamant, adamant  86  same 
2 adamant, adamant bild  86 not same 
3 adamant, adamant bild  86 not same 

以前我建議追加行已初始化的數據幀。從字典列表中創建一個DataFrame,而不是對DataFrame進行很多附加操作,運行速度快9-10倍。

+0

kellehr非常感謝。我得到這個錯誤:TypeError:不支持的操作數類型爲+:'float'和'str' – UserYmY

+1

當你嘗試'compare = str(vals [i])+','+ str(vals [j ])? –

+0

Thans工作。問題在於數據幀非常大,有193000行。這個解決方案可以更高效嗎? – UserYmY

-1

這是另一種選擇。不知道是否它的效率更高,雖然

import itertools 
import pandas as pd 

new_df = pd.DataFrame() 
for grp in set(dfm['group']): 
    for combo in itertools.combinations(dfm[dfm['group'] == grp].index, 2): 
     # compute the new values 
     match1 = dfm['match'][combo[0]] 
     match2 = dfm['match'][combo[0]] 
     compare = match1 + ', ' + match2 
     if match1 == match2: 
      result = 'same' 
     else: 
      result = 'not same' 
     # append the results to the DataFrame 
     new_df = new_df.append({'group': grp, 'compare': compare, 'result': result}, ignore_index=True) 

print new_df 

(格式化從詹姆斯的回答借來的)