2016-01-05 70 views
1

我的數據是這樣的:如何計算pandas數據框中列的非NaN值?

  Close a b c d e Time  
2015-12-03 2051.25 5 4 3 1 1 05:00:00  
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 

我需要在列[「一」]來算「水平」的值[「E」]不屬於NaN的。所以結局會是這樣:

df['Count'] = ..... 
df 

      Close a b c d e Time  Count 
2015-12-03 2051.25 5 4 3 1 1 05:00:00 5 
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 4 
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 3 
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2 
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 1 

感謝

+3

您所需的df與您的起始df完全不同,您有額外的'NaN'值從第二行開始到最後一行 – EdChum

+0

謝謝,糾正了錯字 – hernanavella

回答

3

您可以從您的df中選擇並呼叫count通過axis=1

In [24]: 
df['count'] = df[list('abcde')].count(axis=1) 
df 

Out[24]: 
       Close a b c d e  Time count 
2015-12-03 2051.25 5 4 3 1 1 05:00:00  5 
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00  4 
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00  3 
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00  2 
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00  1 

的時間設置

In [25]: 
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) 
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1) 
%timeit df[list('abcde')].count(axis=1) 

100 loops, best of 3: 3.28 ms per loop 
100 loops, best of 3: 2.76 ms per loop 
100 loops, best of 3: 2.98 ms per loop 

apply是不是一個驚喜最慢的drop版本稍快,但在語義上我更喜歡只是路過感興趣的cols的名單,並呼籲count的可讀性

嗯我現在不斷變化的時間:

In [27]: 
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) 
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1) 
%timeit df[list('abcde')].count(axis=1) 
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1) 

100 loops, best of 3: 3.33 ms per loop 
100 loops, best of 3: 2.7 ms per loop 
100 loops, best of 3: 2.7 ms per loop 
100 loops, best of 3: 2.57 ms per loop 

更多的時間設置

In [160]: 
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) 
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1) 
%timeit df[list('abcde')].count(axis=1) 
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1) 
%timeit df[list('abcde')].notnull().sum(axis=1) 

1000 loops, best of 3: 1.4 ms per loop 
1000 loops, best of 3: 1.14 ms per loop 
1000 loops, best of 3: 1.11 ms per loop 
1000 loops, best of 3: 1.11 ms per loop 
1000 loops, best of 3: 1.05 ms per loop 

看來,測試notnull和總結(如notnull會產生一個布爾掩碼)是該數據集

快上5萬的行DF的最後一個方法是稍快:

In [172]: 
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) 
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1) 
%timeit df[list('abcde')].count(axis=1) 
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1) 
%timeit df[list('abcde')].notnull().sum(axis=1) 

1 loops, best of 3: 5.83 s per loop 
100 loops, best of 3: 6.15 ms per loop 
100 loops, best of 3: 6.49 ms per loop 
100 loops, best of 3: 6.04 ms per loop 
+0

另外一個你可以嘗試的是:df [list('abcde')] .nulln()。sum(axis = 1),它比任何一個以上方法在我的測試中。 – n8yoder

+1

@ n8yoder稍微快一點,會嘗試更大的數據集 – EdChum

1

包括所需columns,或只是下降的列表中兩個columns你不想從計數排除 - 沿axis=1(see docs)

df['Count'] = df.drop(['Close', 'Time'], axis=1).count(axis=1) 


    Close a b c d e  Time Count 
0 2051.25 5 4 3 1 1 05:00:00  5 
1 2088.25 5 4 3 1 NaN 06:00:00  4 
2 2081.50 5 4 3 NaN NaN 07:00:00  3 
3 2058.25 5 4 3 NaN NaN 08:00:00  3 
4 2042.25 5 4 NaN NaN NaN 09:00:00  2 
1
df['Count'] = df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) 

In [1254]: df 
Out[1254]: 
       Close a b c d e  Time Count 
2015-12-03 2051.25 5 4 3 1 1 05:00:00  5 
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00  4 
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00  3 
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00  2 
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00  1 
相關問題