2017-05-12 52 views
1

我有兩列包含單個DataFrame對象內的元組。Python熊貓 - 比較數據框元組值

df a       b 
     ('chicken wing', 1)  ('saucy', 0.35) 
     ('burger', 0.85)   ('mason', 0.97) 
     ('burping', 0.37)   ('lost in space', 0.47) 
     ('marvelous', 1)   ('tremendous', .85) 

我需要返回包含較高數字的元組到一個新的列。不要緊,如果老列保持內df或不

結果

df  max_value 

     ('chicken wing', 1) 
     ('mason', 0.97) 
     ('lost in space', 0.47) 
     ('marvelous', 1) 

回答

1

你可以做這樣的:

In [1]: df['a'].where(df.apply(lambda row: row['a'][1] > row['b'][1], axis=1), df['b']) 

Out [1]: 

0  (chicken wing, 1) 
1   (mason, 0.97) 
2 (lost in space, 0.47) 
3   (marvelous, 1) 
Name: a, dtype: object 

所以在這裏我們使用lambda比較元組每行生成一個布爾型掩碼,然後使用where來返回列a if True否則返回列'b'

apply輸出:

In[3]: 
df.apply(lambda row: row['a'][1] > row['b'][1], axis=1) 

Out[3]: 
0  True 
1 False 
2 False 
3  True 
dtype: bool 

更高性能的方法是提取比例爲單獨的列,所以你可以使用在比較的矢量化方法:

In[4]: 
df['a_%'] = df['a'].apply(lambda x: x[1]) 
df['b_%'] = df['b'].apply(lambda x: x[1]) 
df 

Out[4]: 
        a      b a_% b_% 
0 (chicken wing, 1)   (saucy, 0.35) 1.00 0.35 
1  (burger, 0.85)   (mason, 0.97) 0.85 0.97 
2 (burping, 0.37) (lost in space, 0.47) 0.37 0.47 
3  (marvelous, 1)  (tremendous, 0.85) 1.00 0.85 

In[5]: 
df['max_value'] = df['a'].where(df['a_%'] > df['b_%'], df['b']) 
df 

Out[5]: 
        a      b a_% b_%    max_value 
0 (chicken wing, 1)   (saucy, 0.35) 1.00 0.35  (chicken wing, 1) 
1  (burger, 0.85)   (mason, 0.97) 0.85 0.97   (mason, 0.97) 
2 (burping, 0.37) (lost in space, 0.47) 0.37 0.47 (lost in space, 0.47) 
3  (marvelous, 1)  (tremendous, 0.85) 1.00 0.85   (marvelous, 1) 

你也可以定義自定義功能來處理的cols的動態數量和使用max

In[11]: 
def func(x): 
    vals = [y[1] for y in x] 
    return x[vals.index(max(vals))] 
df.apply(lambda row: func(row), axis=1) 

Out[11]: 
0  (chicken wing, 1) 
1   (mason, 0.97) 
2 (lost in space, 0.47) 
3   (marvelous, 1) 
dtype: object 
+0

聰明!我需要學習如何思考numpy風格,因爲我覺得性能會比簡單的'apply'更好 - –

+1

您需要首先將元組中的百分比提取到單獨的列中,將非標量值存儲在pandas數據框中非高性能 – EdChum

+1

查看更高性能的方法的更新答案,雖然這將涉及添加額外的列 – EdChum

1

試試這個

def compare_tuples(row): 
    if row['a'][1] >= row['b'][1]: 
     return row['a'] 
    else: 
     return row['b'] 
df['larger'] = df.apply(compare_tuples, axis=1) 
1
In [1]: import pandas as pd 

In [2]: df = pd.DataFrame({"a" : [('chicken wing', 1), ('burger', 0.85), ('burping', 0.37), ('marvelous', 1)], "b": [('saucy', 0.35), ('mason', 0.97), ('lost in space', 0.47), ('tremendous', .85)]}) 

In [3]: df['max_value'] = [a_value if (a_value[1] > b_value[1]) else b_value for a_value, b_value in zip(df.a, df.b)] 

In [4]: df 
Out[4]: 
        a      b    max_value 
0 (chicken wing, 1)   (saucy, 0.35)  (chicken wing, 1) 
1  (burger, 0.85)   (mason, 0.97)   (mason, 0.97) 
2 (burping, 0.37) (lost in space, 0.47) (lost in space, 0.47) 
3  (marvelous, 1)  (tremendous, 0.85)   (marvelous, 1)