在熊貓中使用2列應用函數

請考慮下面的「exampleDF」。在熊貓中使用2列應用函數

name age sex 
a 21  male 
b 13 female 
c 56  female 
d 12  male 
e 45  nan 
f 10  female

我想創建使用年齡和性別的新列，因此，如果年齡是child否則它等於性。

我已經試過這

exampleDF['newColumn'] = exampleDF[['age','sex']].apply(lambda age,sex: 'child' if age < 15 else sex)

，但我得到一個錯誤missing 1 required positional argument: 'sex'

請幫我什麼，我做錯了。

來源

2017-04-18 Harj

這將做的工作：

import pandas as pd 
exampleDF=pd.DataFrame({'name':['a','b','c','d','e','f'],'age':[21,13,56,12,45,10],'sex':['male','female','female','male',None,'male']}) 
exampleDF['newColumn'] = exampleDF[['age','sex']].apply(lambda x: 'child' if x['age'] < 15 else x['sex'],axis=1)

然後exampleDF是：

age name sex  newColumn 
0 21 a  male male 
1 13 b  female child 
2 56 c  female female 
3 12 d  male child 
4 45 e  None None 
5 10 f  male child

在你的代碼試圖定義lambda age,sex:，但你不能這樣做，因爲exampleDF[['age','sex']]是一個數據幀與兩列（而不是兩列）。上述解決方案可解決此問題。另外，您還需要指定軸。

來源

2017-04-18 03:33:52

這可能是一個非常愚蠢的問題，但是什麼時候'axis = 1'需要指定？我以前在lambda中使用了apply函數，但沒有指定軸。 – Harj

默認情況下axis = 0，所以你在這些行上應用這個函數（參見http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html），我假設這是你以前想要的。但是，在這裏，您想將其應用於x ['age']和x ['sex']兩列，因此您需要指定axis = 1。 –

繼續我以前的評論，這裏有兩個例子：第一個例子是exampleDF ['newColumn'] = exampleDF [['age']] .application（lambda x：x ** 2）。在這裏，您只需在exampleDF [['age']]中的每個元素上應用平方動作，所以您不需要指定axis = 1。然而，在下面的例子中：exampleDF ['newColumn'] = exampleDF [['age']]。apply（lambda x：True if x ['age']> 15 else False，axis = 1） = 1，因爲函數直接應用於整個列。 –

我覺得更好的是使用mask - 如果從sex列其他True在boolean mask獲取價值得到child字符串新列：

print (exampleDF['age'] < 15) 
0 False 
1  True 
2 False 
3  True 
4 False 
5  True 
Name: age, dtype: bool 


exampleDF['newColumn'] = exampleDF['sex'].mask(exampleDF['age'] < 15, 'child') 
print (exampleDF) 
    name age  sex newColumn 
0 a 21 male  male 
1 b 13 female  child 
2 c 56 female female 
3 d 12 male  child 
4 e 45  NaN  NaN 
5 f 10 female  child

的解決方案

主要優點是它是更快：

#small 6 rows df 
In [63]: %timeit exampleDF['sex'].mask(exampleDF['age'] < 15, 'child') 
1000 loops, best of 3: 517 µs per loop 

In [64]: %timeit exampleDF[['age','sex']].apply(lambda x: 'child' if x['age'] < 15 else x['sex'],axis=1) 
1000 loops, best of 3: 867 µs per loop

#bigger 6k df 
exampleDF = pd.concat([exampleDF]*1000).reset_index(drop=True) 

In [66]: %timeit exampleDF['sex'].mask(exampleDF['age'] < 15, 'child') 
The slowest run took 5.41 times longer than the fastest. This could mean that an intermediate result is being cached. 
1000 loops, best of 3: 589 µs per loop 

In [67]: %timeit exampleDF[['age','sex']].apply(lambda x: 'child' if x['age'] < 15 else x['sex'],axis=1) 
10 loops, best of 3: 104 ms per loop

#bigger 60k df - apply very slow 
exampleDF = pd.concat([exampleDF]*10000).reset_index(drop=True) 

In [69]: %timeit exampleDF['sex'].mask(exampleDF['age'] < 15, 'child') 
1000 loops, best of 3: 1.23 ms per loop 

In [70]: %timeit exampleDF[['age','sex']].apply(lambda x: 'child' if x['age'] < 15 else x['sex'],axis=1) 
1 loop, best of 3: 1.03 s per loop

來源

2017-04-18 06:37:22 jezrael

在熊貓中使用2列應用函數

回答

相關問題