我覺得更好的是使用mask
- 如果從sex
列其他True
在boolean mask
獲取價值得到child
字符串新列:
print (exampleDF['age'] < 15)
0 False
1 True
2 False
3 True
4 False
5 True
Name: age, dtype: bool
exampleDF['newColumn'] = exampleDF['sex'].mask(exampleDF['age'] < 15, 'child')
print (exampleDF)
name age sex newColumn
0 a 21 male male
1 b 13 female child
2 c 56 female female
3 d 12 male child
4 e 45 NaN NaN
5 f 10 female child
的解決方案
主要優點是它是更快:
#small 6 rows df
In [63]: %timeit exampleDF['sex'].mask(exampleDF['age'] < 15, 'child')
1000 loops, best of 3: 517 µs per loop
In [64]: %timeit exampleDF[['age','sex']].apply(lambda x: 'child' if x['age'] < 15 else x['sex'],axis=1)
1000 loops, best of 3: 867 µs per loop
#bigger 6k df
exampleDF = pd.concat([exampleDF]*1000).reset_index(drop=True)
In [66]: %timeit exampleDF['sex'].mask(exampleDF['age'] < 15, 'child')
The slowest run took 5.41 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 589 µs per loop
In [67]: %timeit exampleDF[['age','sex']].apply(lambda x: 'child' if x['age'] < 15 else x['sex'],axis=1)
10 loops, best of 3: 104 ms per loop
#bigger 60k df - apply very slow
exampleDF = pd.concat([exampleDF]*10000).reset_index(drop=True)
In [69]: %timeit exampleDF['sex'].mask(exampleDF['age'] < 15, 'child')
1000 loops, best of 3: 1.23 ms per loop
In [70]: %timeit exampleDF[['age','sex']].apply(lambda x: 'child' if x['age'] < 15 else x['sex'],axis=1)
1 loop, best of 3: 1.03 s per loop
這可能是一個非常愚蠢的問題,但是什麼時候'axis = 1'需要指定?我以前在lambda中使用了apply函數,但沒有指定軸。 – Harj
默認情況下axis = 0,所以你在這些行上應用這個函數(參見http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html),我假設這是你以前想要的。但是,在這裏,您想將其應用於x ['age']和x ['sex']兩列,因此您需要指定axis = 1。 –
繼續我以前的評論,這裏有兩個例子:第一個例子是exampleDF ['newColumn'] = exampleDF [['age']] .application(lambda x:x ** 2)。在這裏,您只需在exampleDF [['age']]中的每個元素上應用平方動作,所以您不需要指定axis = 1。然而,在下面的例子中:exampleDF ['newColumn'] = exampleDF [['age']]。apply(lambda x:True if x ['age']> 15 else False,axis = 1) = 1,因爲函數直接應用於整個列。 –