2015-11-27 56 views
2

我有兩個datetimeindexed數據框。其中一個缺少一些日期時間(df1),而另一個完成時(本系列中沒有任何缺口的常規時間戳),並且滿足NaN的(df2)。填寫datetimeindex差距由NaN

我試圖從DF1值匹配的df2指數,與NaN的地方這樣的datetimeindex不存在df1填充。

實施例:

In [51]: df1 
Out [51]:      value 
      2015-01-01 14:00:00 20 
      2015-01-01 15:00:00 29 
      2015-01-01 16:00:00 41 
      2015-01-01 17:00:00 43 
      2015-01-01 18:00:00 26 
      2015-01-01 19:00:00 20 
      2015-01-01 20:00:00 31 
      2015-01-01 21:00:00 35 
      2015-01-01 22:00:00 39 
      2015-01-01 23:00:00 17 
      2015-03-01 00:00:00 6 
      2015-03-01 01:00:00 37 
      2015-03-01 02:00:00 56 
      2015-03-01 03:00:00 12 
      2015-03-01 04:00:00 41 
      2015-03-01 05:00:00 31 
      ... ... 

      2018-12-25 23:00:00 41 

      <34843 rows × 1 columns> 

In [52]: df2 = pd.DataFrame(data=None, index=pd.date_range(freq='60Min', start=df1.index.min(), end=df1.index.max())) 
      df2['value']=np.NaN 
      df2 
Out [52]:      value 
      2015-01-01 14:00:00 NaN 
      2015-01-01 15:00:00 NaN 
      2015-01-01 16:00:00 NaN 
      2015-01-01 17:00:00 NaN 
      2015-01-01 18:00:00 NaN 
      2015-01-01 19:00:00 NaN 
      2015-01-01 20:00:00 NaN 
      2015-01-01 21:00:00 NaN 
      2015-01-01 22:00:00 NaN 
      2015-01-01 23:00:00 NaN 
      2015-01-02 00:00:00 NaN 
      2015-01-02 01:00:00 NaN 
      2015-01-02 02:00:00 NaN 
      2015-01-02 03:00:00 NaN 
      2015-01-02 04:00:00 NaN 
      2015-01-02 05:00:00 NaN 
      ...     ... 
      2018-12-25 23:00:00 NaN 

      <34906 rows × 1 columns> 

使用df2.combine_first(df1)返回相同的數據df1.reindex(index= df2.index),填補代替的NaN其中不應該有一些值數據的任何間隙。

In [53]: Result = df2.combine_first(df1) 
      Result 
Out [53]:      value 
      2015-01-01 14:00:00 20 
      2015-01-01 15:00:00 29 
      2015-01-01 16:00:00 41 
      2015-01-01 17:00:00 43 
      2015-01-01 18:00:00 26 
      2015-01-01 19:00:00 20 
      2015-01-01 20:00:00 31 
      2015-01-01 21:00:00 35 
      2015-01-01 22:00:00 39 
      2015-01-01 23:00:00 17 
      2015-01-02 00:00:00 35 
      2015-01-02 01:00:00 53 
      2015-01-02 02:00:00 28 
      2015-01-02 03:00:00 48 
      2015-01-02 04:00:00 42 
      2015-01-02 05:00:00 51 
      ...     ... 
      2018-12-25 23:00:00 41 

      <34906 rows × 1 columns> 

這是我希望得到:

Out [53]:      value 
      2015-01-01 14:00:00 20 
      2015-01-01 15:00:00 29 
      2015-01-01 16:00:00 41 
      2015-01-01 17:00:00 43 
      2015-01-01 18:00:00 26 
      2015-01-01 19:00:00 20 
      2015-01-01 20:00:00 31 
      2015-01-01 21:00:00 35 
      2015-01-01 22:00:00 39 
      2015-01-01 23:00:00 17 
      2015-01-02 00:00:00 NaN 
      2015-01-02 01:00:00 NaN 
      2015-01-02 02:00:00 NaN 
      2015-01-02 03:00:00 NaN 
      2015-01-02 04:00:00 NaN 
      2015-01-02 05:00:00 NaN 
      ...     ... 
      2018-12-25 23:00:00 41 

      <34906 rows × 1 columns> 

可能有人能夠解釋爲什麼發生這種情況的一些光,以及如何設置這些值是如何填補?

回答

0

IIUC你需要resampledf1,因爲你有一個不規則的frequency,你需要定期頻率:

print df1.index.freq 
None 

print Result.index.freq 
<60 * Minutes> 

EDIT1
您可以使用功能asfreq代替resample - docresample vs asfreq

EDIT2
首先我認爲resample沒有工作,因爲重採樣Result後是一樣的df1。但我嘗試print df1.info()print Result.info()得到不同的結果 - 34857 entries34920 entries。 所以我試圖找到NaN值的行,它返回63 rows

所以我認爲resample運作良好。

import pandas as pd 

df1 = pd.read_csv('test/GapInTimestamps.csv', sep=",", index_col=[0], parse_dates=[0]) 
print df1.head() 

#      value 
#Date/Time     
#2015-01-01 00:00:00  52 
#2015-01-01 01:00:00  5 
#2015-01-01 02:00:00  12 
#2015-01-01 03:00:00  54 
#2015-01-01 04:00:00  47 
print df1.info() 

#<class 'pandas.core.frame.DataFrame'> 
#DatetimeIndex: 34857 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00 
#Data columns (total 1 columns): 
#value 34857 non-null int64 
#dtypes: int64(1) 
#memory usage: 544.6 KB 
#None 

Result = df1.resample('60min') 
print Result.head() 

#      value 
#Date/Time     
#2015-01-01 00:00:00  52 
#2015-01-01 01:00:00  5 
#2015-01-01 02:00:00  12 
#2015-01-01 03:00:00  54 
#2015-01-01 04:00:00  47 
print Result.info() 

#<class 'pandas.core.frame.DataFrame'> 
#DatetimeIndex: 34920 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00 
#Freq: 60T 
#Data columns (total 1 columns): 
#value 34857 non-null float64 
#dtypes: float64(1) 
#memory usage: 545.6 KB 
#None 

#find values with NaN 
resultnan = Result[Result.isnull().any(axis=1)] 
#temporaly display 999 rows and 15 columns 
with pd.option_context('display.max_rows', 999, 'display.max_columns', 15): 
    print resultnan 

#      value 
#Date/Time     
#2015-01-13 19:00:00 NaN 
#2015-01-13 20:00:00 NaN 
#2015-01-13 21:00:00 NaN 
#2015-01-13 22:00:00 NaN 
#2015-01-13 23:00:00 NaN 
#2015-01-14 00:00:00 NaN 
#2015-01-14 01:00:00 NaN 
#2015-01-14 02:00:00 NaN 
#2015-01-14 03:00:00 NaN 
#2015-01-14 04:00:00 NaN 
#2015-01-14 05:00:00 NaN 
#2015-01-14 06:00:00 NaN 
#2015-01-14 07:00:00 NaN 
#2015-01-14 08:00:00 NaN 
#2015-01-14 09:00:00 NaN 
#2015-02-01 00:00:00 NaN 
#2015-02-01 01:00:00 NaN 
#2015-02-01 02:00:00 NaN 
#2015-02-01 03:00:00 NaN 
#2015-02-01 04:00:00 NaN 
#2015-02-01 05:00:00 NaN 
#2015-02-01 06:00:00 NaN 
#2015-02-01 07:00:00 NaN 
#2015-02-01 08:00:00 NaN 
#2015-02-01 09:00:00 NaN 
#2015-02-01 10:00:00 NaN 
#2015-02-01 11:00:00 NaN 
#2015-02-01 12:00:00 NaN 
#2015-02-01 13:00:00 NaN 
#2015-02-01 14:00:00 NaN 
#2015-02-01 15:00:00 NaN 
#2015-02-01 16:00:00 NaN 
#2015-02-01 17:00:00 NaN 
#2015-02-01 18:00:00 NaN 
#2015-02-01 19:00:00 NaN 
#2015-02-01 20:00:00 NaN 
#2015-02-01 21:00:00 NaN 
#2015-02-01 22:00:00 NaN 
#2015-02-01 23:00:00 NaN 
#2015-11-01 00:00:00 NaN 
#2015-11-01 01:00:00 NaN 
#2015-11-01 02:00:00 NaN 
#2015-11-01 03:00:00 NaN 
#2015-11-01 04:00:00 NaN 
#2015-11-01 05:00:00 NaN 
#2015-11-01 06:00:00 NaN 
#2015-11-01 07:00:00 NaN 
#2015-11-01 08:00:00 NaN 
#2015-11-01 09:00:00 NaN 
#2015-11-01 10:00:00 NaN 
#2015-11-01 11:00:00 NaN 
#2015-11-01 12:00:00 NaN 
#2015-11-01 13:00:00 NaN 
#2015-11-01 14:00:00 NaN 
#2015-11-01 15:00:00 NaN 
#2015-11-01 16:00:00 NaN 
#2015-11-01 17:00:00 NaN 
#2015-11-01 18:00:00 NaN 
#2015-11-01 19:00:00 NaN 
#2015-11-01 20:00:00 NaN 
#2015-11-01 21:00:00 NaN 
#2015-11-01 22:00:00 NaN 
#2015-11-01 23:00:00 NaN 
+0

您可以[接受](http://stackoverflow.com/tour)的答案。謝謝。 – jezrael

+0

謝謝你的建議@jezrael,我試過你的方法,但仍然有同樣的問題使用'asfreq'或'resample'。填入的空白使系列經常包含不應該在那裏的數據。索引中還有其他漏洞可能會產生一些影響。如果有幫助,我使用熊貓版本0.14.1和Python 2.7.10 – tg359x

+0

我添加了我的測試數據,仍然是同樣的問題?如果是的話,它可以是你的版本0.14.1 - 我使用0.17.1,它運行良好。 – jezrael