2012-11-13 152 views
15

我有一個字典名稱date_dict由日期時間日期鍵入的值與整數觀測值相對應。我將它轉換爲一個稀疏的序列/數據幀,我想加入或轉換爲具有連續日期的序列/數據幀。討厭的列表理解是我的竅門,以解決熊貓顯然不會自動將日期時間日期對象隱藏到適當的日期時間索引的事實。從稀疏數據框填充連續熊貓數據框

df1 = pd.DataFrame(data=date_dict.values(), 
        index=[datetime.datetime.combine(i, datetime.time()) 
          for i in date_dict.keys()], 
        columns=['Name']) 
df1 = df1.sort(axis=0) 

此示例有1258個觀察值,DateTime索引從2003-06-24運行到2012-11-07。

df1.head() 
      Name 
Date 
2003-06-24 2 
2003-08-13 1 
2003-08-19 2 
2003-08-22 1 
2003-08-24 5 

我可以創建一個連續的日期時間指數空數據幀,但是這引入了不必要的列,似乎笨重。我覺得我錯過了一個涉及連接的更優雅的解決方案。

df2 = pd.DataFrame(data=None,columns=['Empty'], 
        index=pd.DateRange(min(date_dict.keys()), 
             max(date_dict.keys()))) 
df3 = df1.join(df2,how='right') 
df3.head() 
      Name Empty 
2003-06-24 2 NaN 
2003-06-25 NaN NaN 
2003-06-26 NaN NaN 
2003-06-27 NaN NaN 
2003-06-30 NaN NaN 

是否有一個更簡單或更優雅的方式來從稀疏數據幀填滿的連續數據幀,使得有(1)的連續指數,(2)的NaN是0,以及(3)沒有數據幀中剩餘的空列?

  Name 
2003-06-24 2 
2003-06-25 0 
2003-06-26 0 
2003-06-27 0 
2003-06-30 0 

回答

20

您可以在時間序列上使用reindex使用日期範圍。此外,看起來您最好使用TimeSeries而不是DataFrame(請參閱documentation),儘管重新索引也是將缺少的索引值添加到DataFrame的正確方法。

例如,首先:

date_index = pd.DatetimeIndex([pd.datetime(2003,6,24), pd.datetime(2003,8,13), 
     pd.datetime(2003,8,19), pd.datetime(2003,8,22), pd.datetime(2003,8,24)]) 

ts = pd.Series([2,1,2,1,5], index=date_index) 

給你一個時間序列像你的榜樣數據幀的頭:

2003-06-24 2 
2003-08-13 1 
2003-08-19 2 
2003-08-22 1 
2003-08-24 5 

簡單地做

ts.reindex(pd.date_range(min(date_index), max(date_index))) 

然後給你一個完整的指數,用NaN表示您的缺失值(如果您想填滿mi,則可以使用fillna用一些其他值取值 - 見here):

2003-06-24  2 
2003-06-25 NaN 
2003-06-26 NaN 
2003-06-27 NaN 
2003-06-28 NaN 
2003-06-29 NaN 
2003-06-30 NaN 
2003-07-01 NaN 
2003-07-02 NaN 
2003-07-03 NaN 
2003-07-04 NaN 
2003-07-05 NaN 
2003-07-06 NaN 
2003-07-07 NaN 
2003-07-08 NaN 
2003-07-09 NaN 
2003-07-10 NaN 
2003-07-11 NaN 
2003-07-12 NaN 
2003-07-13 NaN 
2003-07-14 NaN 
2003-07-15 NaN 
2003-07-16 NaN 
2003-07-17 NaN 
2003-07-18 NaN 
2003-07-19 NaN 
2003-07-20 NaN 
2003-07-21 NaN 
2003-07-22 NaN 
2003-07-23 NaN 
2003-07-24 NaN 
2003-07-25 NaN 
2003-07-26 NaN 
2003-07-27 NaN 
2003-07-28 NaN 
2003-07-29 NaN 
2003-07-30 NaN 
2003-07-31 NaN 
2003-08-01 NaN 
2003-08-02 NaN 
2003-08-03 NaN 
2003-08-04 NaN 
2003-08-05 NaN 
2003-08-06 NaN 
2003-08-07 NaN 
2003-08-08 NaN 
2003-08-09 NaN 
2003-08-10 NaN 
2003-08-11 NaN 
2003-08-12 NaN 
2003-08-13  1 
2003-08-14 NaN 
2003-08-15 NaN 
2003-08-16 NaN 
2003-08-17 NaN 
2003-08-18 NaN 
2003-08-19  2 
2003-08-20 NaN 
2003-08-21 NaN 
2003-08-22  1 
2003-08-23 NaN 
2003-08-24  5 
Freq: D, Length: 62 
+2

謝謝!我用 ts.reindex(pd.date_range(min(date_index),max(date_index)),fill_value = 0) –