2015-09-06 182 views
2

我有一個熊貓據幀school_df,看起來像這樣:通過迭代大熊貓GROUPBY組

school_id date_posted date_completed 
0 A   2014-01-01 2014-01-01 
1 A   2014-01-01 2014-01-08 
2 A   2014-04-29 2014-05-01 
3 B   2014-01-01 2014-01-01 
4 B   2014-01-20 2014-02-23 

每一行代表由學校一個項目。我想添加兩列:對於每個唯一的school_id,計算在該日期之前發佈的項目數量以及在該日期之前完成了多少項目的計數。

下面的代碼有效,但我有大約300,000個獨特的學校,所以需要很長時間才能運行。有沒有更快的方式來獲得我正在尋找的東西?謝謝您的幫助!

import pandas as pd 
groups = school_df.groupby("school_id") 
blank_df = pd.DataFrame() 
for g, df in groups: 
    df['school_previous_projects'] = df.date_posted.map(lambda x: len(df[df.date_posted < x])) 
    df['school_previous_completed'] = df.date_posted.map(lambda x: len(df[df.date_completed < x])) 
    blank_df = pd.concat([blank_df, df]) 
+0

@BobHaffner有一個很好的答案。在盒子外面思考,你可以分組學校,並在日期欄中一次設置索引。然後你可以使用滾動計數,因爲它將按日期排序。這比使用apply方法和檢查每行的len要快得多。查看cumcount http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html –

+0

我同意@BrianPendleton我的方法可能會比您的方法更快,但可能會有一個更好的方法。 –

回答

0

下面是使用cumcount版本(我簡化了日期,但還是應該工作):

import pandas as pd 
import io 


df = pd.DataFrame({'school_id': ['A', 'A', 'A', 'B', 'B'], 
        'date_posted': pd.date_range('2014-01-01', '2014-01-05'), 
        'date_completed': pd.date_range('2014-01-01', '2014-01-05')}) 

posted = df.set_index('date_posted').groupby('school_id').cumcount() 
comp = df.set_index('date_completed').groupby('school_id').cumcount() 

df['posted'] = posted.values 
df['comp'] = comp.values 

print df 

結果:

date_completed date_posted school_id posted comp 
0  2014-01-01 2014-01-01   A  0  0 
1  2014-01-02 2014-01-02   A  1  1 
2  2014-01-03 2014-01-03   A  2  2 
3  2014-01-04 2014-01-04   B  0  0 
4  2014-01-05 2014-01-05   B  1  1 
1

試試看。應該比你的for循環和兩張地圖更快。從你的框架開始

school_id date_posted date_completed 
0 A   2014-01-01 2014-01-01 
1 A   2014-01-01 2014-01-08 
2 A   2014-04-29 2014-05-01 
3 B   2014-01-01 2014-01-01 
4 B   2014-01-20 2014-02-23 

然後一個函數。 getProjectCounts()使用布爾索引和一個簡單的計數()

def getProjectCounts(row, df): 
    filter = (df["school_id"] == row["school_id"]) & (df["date_posted"] < row["date_posted"]) 
    dp_count = df[filter]["date_posted"].count() 
    filter = (df["school_id"] == row["school_id"]) & (df["date_completed"] < row["date_completed"]) 
    dc_count = df[filter]["date_completed"].count() 
    return pd.Series([dp_count, dc_count]) 

那麼適用()的函數被排走行

school_df[["school_previous_projects","school_previous_completed"]] = school_df.apply(lambda x : getProjectCounts(x, school_df),axis=1) 


    school_id date_posted date_completed school_previous_projects \ 
0   A 2014-01-01  2014-01-01       0 
1   A 2014-01-01  2014-01-08       0 
2   A 2014-04-29  2014-05-01       2 
3   B 2014-01-01  2014-01-01       0 
4   B 2014-01-20  2014-02-23       1 

    school_previous_completed 
0       0 
1       1 
2       2 
3       0 
4       1