2016-10-31 165 views
1

我有以下數據框:熊貓列的條件回填

 DATE  ID  STATUS 
0 2014-01-01 1 INPROGRESS 
1 2013-03-01 1  ENDED 
2 2015-05-01 2 INPROGRESS 
3 2012-05-01 1  STARTED 
4 2011-05-01 2  STARTED 
5 2011-03-01 3  STARTED 
6 2011-04-01 3  ENDED 
7 2011-06-01 3 INPROGRESS 
8 2011-09-01 3  STARTED 

這裏的代碼來構建它:

>>> df1 = pd.DataFrame(columns=["DATE", "ID", "STATUS"]) 
>>> df1["DATE"] = ['2014-01-01', '2013-03-01', '2015-05-01', '2012-05-01', '2011-05-01', '2011-03-01', '2011-04-01', '2011-06-01', '2011-09-01'] 
>>> df1["ID"] = [1,1,2,1,2,3,3,3,3] 
>>> df1["STATUS"] = ['INPROGRESS', 'ENDED', 'INPROGRESS', 'STARTED', 'STARTED', 'STARTED','ENDED', 'INPROGRESS', 'STARTED'] 

每個ID組狀態列表示,可以是一個任務:

STARTED,INPROGRESS或ENDED

以這個精確的時間順序(STARTED應該是no t在ENDED等後出現)。

通過由ID分組和按日期我獲得ID 3排序:

df1.sort_values('DATE')[df1['ID']==3] 

    DATE  ID  STATUS 
5 2011-03-01 3  STARTED 
6 2011-04-01 3  ENDED 
7 2011-06-01 3 INPROGRESS 
8 2011-09-01 3  STARTED 

不,我需要「修復」狀態欄跟隨基礎上,最後狀態上面定義的順序。對於ID 3的最後狀態開始,所以一切都應該被回填,以作爲後續啓動的狀態:

 DATE  ID  STATUS 
5 2011-03-01 3  STARTED 
6 2011-04-01 3  STARTED 
7 2011-06-01 3  STARTED 
8 2011-09-01 3  STARTED 

對於ID 1:

df1.sort_values('DATE')[df1['ID']==1] 
    DATE ID  STATUS 
3 2012-05-01 1  STARTED 
1 2013-03-01 1  ENDED 
0 2014-01-01 1 INPROGRESS 

我將結束了最後兩個狀態INPROGRESS和請以STARTED開頭:

df1.sort_values('DATE')[df1['ID']==1] 
    DATE ID  STATUS 
3 2012-05-01 1  STARTED 
1 2013-03-01 1 INPROGRESS 
0 2014-01-01 1 INPROGRESS 

ID 2的順序是正確的。

任何想法如何用熊貓來做到這一點? 我試圖通過ID進行分組,我正在考慮基於最後狀態的回填,但我不知道如何在適當的時候停止回填。

謝謝!

回答

2

一個經典的方法是忘記你的狀態是標籤:改爲將它們視爲嚴格增加的數字,如開始1,進行中2和結束3.使用這樣的列,你現在可以檢查每組這些數字的單調性,然後回填,直到你看到單調中斷。通過ID

keymapping = {'STARTED':0, 'INPROGRESS':1, 'ENDED':2} 
df['STATUS_ID'] = df.STATUS.map(keymapping) 
df.set_index(['ID', 'DATE'], inplace=True) 
df.sort_index(inplace=True) 

現在,組,並使用transform讓整個指數每組傳播的最後一個值,這樣就可以把它分配給您的數據幀作爲新列:

準備好您的數據幀:

df['STATUS_LAST'] = df.groupby(level=0, as_index=False).STATUS_ID.transform('last') 

df 
Out[63]: 
        STATUS STATUS_ID STATUS_LAST 
ID DATE           
1 2012-05-01  STARTED   0   1 
    2013-03-01  ENDED   2   1 
    2014-01-01 INPROGRESS   1   1 
2 2011-05-01  STARTED   0   1 
    2015-05-01 INPROGRESS   1   1 
3 2011-03-01  STARTED   0   0 
    2011-04-01  ENDED   2   0 
    2011-06-01 INPROGRESS   1   0 
    2011-09-01  STARTED   0   0 

最後,通過使用針對STATUS_ID最後的增加單調申請回填,即STATUS_ID每個值是有效的,如果是低於或等於STATUS_LAST時:

df.STATUS_ID = df.STATUS_ID.where(df.STATUS_ID <= df.STATUS_LAST, df.STATUS_LAST) 
df.STATUS_ID 
Out[65]: 
ID DATE  
1 2012-05-01 0 
    2013-03-01 1 
    2014-01-01 1 
2 2011-05-01 0 
    2015-05-01 1 
3 2011-03-01 0 
    2011-04-01 0 
    2011-06-01 0 
    2011-09-01 0 

扭轉它映射到標籤,並將其分配給STATUS

df.STATUS_ID.map({v:k for k,v in keymapping.items()}) 
Out[66]: 
ID DATE  
1 2012-05-01  STARTED 
    2013-03-01 INPROGRESS 
    2014-01-01 INPROGRESS 
2 2011-05-01  STARTED 
    2015-05-01 INPROGRESS 
3 2011-03-01  STARTED 
    2011-04-01  STARTED 
    2011-06-01  STARTED 
    2011-09-01  STARTED 
Name: STATUS_ID, dtype: object