2015-12-02 112 views
2

我有一個數據幀尋找這樣大熊貓算過多個列

Measure1 Measure2 Measure3 ... 
0  1   3 
1  3   2 
3  0   

我想在列數的值的出現,產生:

Measure Count Percentage 
0  2  0.25 
1  2  0.25 
2  1  0.125 
3  3  0.373 

隨着

outcome_measure_count = cdss_data.groupby(key_columns=['Measure1'],operations={'count': agg.COUNT()}).sort('count', ascending=True) 

我只得到第一列(實際上使用graphlab包,但我更喜歡大熊貓)

有人可以幫我嗎?

回答

0

您可以通過扁平化利用ravelvalue_counts東風,從這裏就可以構成最終的DF產生計數:

In [230]: 
import io 
import pandas as pd 
​ 
t="""Measure1 Measure2 Measure3 
0  1   3 
1  3   2 
3  0  0""" 
​ 
df = pd.read_csv(io.StringIO(t), sep='\s+') 
df 

Out[230]: 
    Measure1 Measure2 Measure3 
0   0   1   3 
1   1   3   2 
2   3   0   0 

In [240]:  
count = pd.Series(df.squeeze().values.ravel()).value_counts() 
pd.DataFrame({'Measure': count.index, 'Count':count.values, 'Percentage':(count/count.sum()).values}) 

Out[240]: 
    Count Measure Percentage 
0  3  3 0.333333 
1  3  0 0.333333 
2  2  1 0.222222 
3  1  2 0.111111 

我插入了0只是爲了讓DF形狀正確的,但你應該得到的點

+0

而當這部分是一個更大的df的一部分?所以我需要指定列?當使用:count = pd.Series(cdss_data ['measure1','measure2']。squeeze()。values.ravel())。value_counts()我得到一個錯誤(cdss_data是我的df) – dsent

+0

您需要雙下標'count = pd.Series(cdss_data [['measure1','measure2']]。squeeze()。values.ravel())。value_count s()' – EdChum

+0

太棒了!有沒有辦法強制行的順序和順序? – dsent

0
In [68]: df=DataFrame({'m1':[0,1,3], 'm2':[1,3,0], 'm3':[3,2, np.nan]}) 

In [69]: df 
Out[69]: 
    m1 m2 m3 
0 0 1 3.0 
1 1 3 2.0 
2 3 0 NaN 

In [70]: df=df.apply(Series.value_counts).sum(1).to_frame(name='Count') 

In [71]: df 
Out[71]: 
    Count 
0.0 2.0 
1.0 2.0 
2.0 1.0 
3.0 3.0 

In [72]: df.index.name='Measure' 

In [73]: df 
Out[73]: 
     Count 
Measure 
0.0  2.0 
1.0  2.0 
2.0  1.0 
3.0  3.0 

In [74]: df['Percentage']=df.Count.div(df.Count.sum()) 

In [75]: df 
Out[75]: 
     Count Percentage 
Measure 
0.0  2.0  0.250 
1.0  2.0  0.250 
2.0  1.0  0.125 
3.0  3.0  0.375