2015-12-22 90 views
4

我剛開始與大熊貓和我想知道如何計算熊貓據幀頻率運行

我的數據是每家公司每年的文件(唯一的)數量: DF

year document_id company 
0 1999 3  Orange 
1 1999 5  Orange 
2 1999 3  Orange 
3 2001 41 Banana 
4 2001 21 Strawberry 
5 2001 18 Strawberry 
6 2002 44 Orange 

在最後,我想有這樣

year document_id company nbDocument 
0 1999 [3,5]  Orange  2 
1 2001 [21]  Banana  1 
2 2001 [21,18] Strawberry 2 
3 2002 [44]  Orange  1 

我嘗試了新的數據幀:

count2 = apyData.groupby(['year','company']).agg({'document_id': pd.Series.value_counts}) 

但與groupby操作,我不能有這種結構和計數在1999年的獨特價值爲例如,有沒有辦法做到這一點?

THX

+0

應該不是'的document_id' '香蕉'是'[41]'? –

回答

1

您可以創建一個新的DataFrame和使用list comprension添加如下獨特document_id:現在

result = pd.DataFrame() 
result['document_id'] = df.groupby(['company', 'year']).apply(lambda x: [d for d in x['document_id'].drop_duplicates()]) 

,你有獨特的document_id列表,你只需要獲取長度此列表的:

result['nbDocument'] = result.document_id.apply(lambda x: len(x)) 

獲得:

result.reset_index().sort_values(['company', 'year']) 

     company year document_id nbDocument 
0  Banana 2001  [41]   1 
1  Orange 1999  [3, 5]   2 
2  Orange 2002  [44]   1 
3 Strawberry 2001 [21, 18]   2 
0

您可以通過agg使用自定義的聚集,然後將列document_id列出:

print apyData 

    afx year document_id  company 
0 0 1999   3  Orange 
1 1 1999   5  Orange 
2 2 1999   3  Orange 
3 3 2001   41  Banana 
4 4 2001   21 Strawberry 
5 5 2001   18 Strawberry 
6 6 2002   44  Orange 

f = {'nbDocument' : lambda x: len(x.unique()), 'document_id' : lambda x: tuple(x)} 
count2 = apyData.groupby(['year','company']).document_id.agg(f).reset_index() 
print count2 

    year  company nbDocument document_id 
0 1999  Orange   2 (3, 5, 3) 
1 2001  Banana   1  (41,) 
2 2001 Strawberry   2 (21, 18) 
3 2002  Orange   1  (44,) 

#convert to list 
count2['document_id'] = count2['document_id'].apply(lambda x: list(x)) 
#reorder columns 
count2 = count2[['year','document_id','company','nbDocument']] 
print count2 

    year document_id  company nbDocument 
0 1999 [3, 5, 3]  Orange   2 
1 2001  [41]  Banana   1 
2 2001 [21, 18] Strawberry   2 
3 2002  [44]  Orange   1 

編輯:

我不能agg使用'document_id' : lambda x: list(x),因爲錯誤:

ValueError: Function does not reduce

所以我用tuple,後來轉換成list

EDIT1:

我檢查定時:

def je(apyData): 
    f = {'nbDocument' : lambda x: len(x.unique()), 'document_id' : lambda x: tuple(x)} 
    count2 = apyData.groupby(['year','company']).document_id.agg(f).reset_index() 
    count2['document_id'] = count2['document_id'].apply(lambda x: list(x)) 
    return count2 

def mm(df): 
    out = pd.DataFrame() 
    grouped = df.groupby(['year', 'company']) 
    out['nbDocument'] = grouped.apply(lambda x: list(x['document_id'].drop_duplicates())) 
    out['document_id'] = out['nbDocument'].apply(lambda x: len(x)) 
    return (out.reset_index().sort_values(['year', 'company'])) 

def st(df): 
    result = pd.DataFrame() 
    result['document_id'] = df.groupby(['company', 'year']).apply(lambda x: [d for d in x['document_id'].drop_duplicates()])  
    result['nbDocument'] = result.document_id.apply(lambda x: len(x)) 
    return result.reset_index().sort_values(['company', 'year']) 

print mm(apyData) 
print st(apyData) 
print je(apyData) 

結果:

In [48]: %timeit je(apyData) 
100 loops, best of 3: 3.08 ms per loop 

In [49]: %timeit mm(apyData) 
100 loops, best of 3: 5.73 ms per loop 

In [50]: %timeit st(apyData) 
100 loops, best of 3: 5.8 ms per loop 
0

這產生所需的輸出:

out = pd.DataFrame() 
grouped = df.groupby(['year', 'company']) 
out['nbDocument'] = grouped.apply(lambda x: list(x['document_id'].drop_duplicates())) 
out['document_id'] = out['nbDocument'].apply(lambda x: len(x)) 
print(out.reset_index().sort_values(['year', 'company'])) 

    year  company nbDocument document_id 
0 1999  Orange  [3, 5]   2 
1 2001  Banana  [41]   1 
2 2001 Strawberry [21, 18]   2 
3 2002  Orange  [44]   1