2016-03-17 84 views
5

使用熊貓,我想獲得一列中的特定值的計數。我知道使用df.somecolumn.ravel()會給我所有的獨特價值和他們的數量。但如何計算一些具體的價值。熊貓,獲得一個數據框的列中的單個值的計數

In[5]:df 
Out[5]: 
     col 
     1 
     1 
     1 
     1 
     2 
     2 
     2 
     1 

期望:

To get count of 1. 

    In[6]:df.somecalulation(1) 
    Out[6]: 5 

    To get count of 2. 

    In[6]:df.somecalulation(2) 
    Out[6]: 3 
+0

Are you optimi zing這對於多個查詢,或者一個小的(或單個)查詢? –

+0

單個小查詢。然後, – Randhawa

+0

看到答案。 –

回答

9

您可以嘗試value_counts

df = df['col'].value_counts().reset_index() 
df.columns = ['col', 'count'] 
print df 
    col count 
0 1  5 
1 2  3 

編輯:

print (df['col'] == 1).sum() 
5 

或者:

def somecalulation(x): 
    return (df['col'] == x).sum() 

print somecalulation(1) 
5 
print somecalulation(2) 
3 

或者:

ser = df['col'].value_counts() 

def somecalulation(s, x): 
    return s[x] 

print somecalulation(ser, 1) 
5 
print somecalulation(ser, 2) 
3 

EDIT2:

如果你需要的東西非常快,使用numpy.in1d

import pandas as pd 
import numpy as np 

a = pd.Series([1, 1, 1, 1, 2, 2]) 

#for testing len(a) = 6000 
a = pd.concat([a]*1000).reset_index(drop=True) 

print np.in1d(a,1).sum() 
4000 
print (a == 1).sum() 
4000 
print np.sum(a==1) 
4000 

時序

len(a)=6

In [131]: %timeit np.in1d(a,1).sum() 
The slowest run took 9.17 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 29.9 µs per loop 

In [132]: %timeit np.sum(a == 1) 
10000 loops, best of 3: 196 µs per loop 

In [133]: %timeit (a == 1).sum() 
1000 loops, best of 3: 180 µs per loop 

len(a)=6000

In [135]: %timeit np.in1d(a,1).sum() 
The slowest run took 7.29 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 48.5 µs per loop 

In [136]: %timeit np.sum(a == 1) 
The slowest run took 5.23 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 273 µs per loop 

In [137]: %timeit (a == 1).sum() 
1000 loops, best of 3: 271 µs per loop 
+0

對不起,有一個錯誤question.i已編輯it.Now看到它。 – Randhawa

+0

如果你需要統計單個項目,'np.in1d'因爲接受解決方案更快。請參閱edit2和時間。謝謝。 – jezrael

2

如果你把value_counts回報,你可以查詢多個值:

import pandas as pd 

a = pd.Series([1, 1, 1, 1, 2, 2]) 
counts = a.value_counts() 
>>> counts[1], counts[2] 
(4, 2) 

然而,只計算一個項目,這將是更快使用

import numpy as np 
np.sum(a == 1) 
相關問題