2015-07-20 300 views
3

假設我已經加載從SQL或CSV(在python沒有創建)一個時間序列數據的時間序列的檢測頻率,該指數將是:Python的大熊貓:

DatetimeIndex(['2015-03-02 00:00:00', '2015-03-02 01:00:00', 
       '2015-03-02 02:00:00', '2015-03-02 03:00:00', 
       '2015-03-02 04:00:00', '2015-03-02 05:00:00', 
       '2015-03-02 06:00:00', '2015-03-02 07:00:00', 
       '2015-03-02 08:00:00', '2015-03-02 09:00:00', 
       ... 
       '2015-07-19 14:00:00', '2015-07-19 15:00:00', 
       '2015-07-19 16:00:00', '2015-07-19 17:00:00', 
       '2015-07-19 18:00:00', '2015-07-19 19:00:00', 
       '2015-07-19 20:00:00', '2015-07-19 21:00:00', 
       '2015-07-19 22:00:00', '2015-07-19 23:00:00'], 
       dtype='datetime64[ns]', name=u'hour', length=3360, freq=None, tz=None) 

正如你所看到的,「頻率'是無。我想知道如何檢測這個系列的頻率,並將頻率設置爲頻率。

如果可能的話,我希望這可以在數據不連續的情況下工作(系列中有很多中斷)。

我試圖找到全2個時間戳之間的差異的模式,但我不知道如何將它轉移到一個格式,可讀

系列
+1

如果有差距,是由頻率差最小的兩個時間戳設置? – mdurant

+0

@mdurant是的,大部分兩個時間戳的差異都是最小的差異 – Jim

回答

3

也許嘗試服用timeindex的差異和使用的模式(或最小差異)作爲頻率。

import pandas as pd 
import numpy as np 

# simulate some data 
# =================================== 
np.random.seed(0) 
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H') 
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False)) 
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index() 
df 

         col 
2015-03-02 01:00:00 2.0261 
2015-03-02 04:00:00 1.3325 
2015-03-02 05:00:00 -0.9867 
2015-03-02 06:00:00 -0.0671 
2015-03-02 08:00:00 -1.1131 
2015-03-02 09:00:00 0.0494 
2015-03-02 10:00:00 -0.8130 
2015-03-02 11:00:00 1.8453 
...      ... 
2015-07-19 13:00:00 -0.4228 
2015-07-19 14:00:00 1.1962 
2015-07-19 15:00:00 1.1430 
2015-07-19 16:00:00 -1.0080 
2015-07-19 18:00:00 0.4009 
2015-07-19 19:00:00 -1.8434 
2015-07-19 20:00:00 0.5049 
2015-07-19 23:00:00 -0.5349 

[2000 rows x 1 columns] 

# processing 
# ================================== 
# the gap distribution 
res = (pd.Series(df.index[1:]) - pd.Series(df.index[:-1])).value_counts() 

01:00:00 1181 
02:00:00  499 
03:00:00  180 
04:00:00  93 
05:00:00  24 
06:00:00  10 
07:00:00  9 
08:00:00  3 
dtype: int64 

# the mode can be considered as frequency 
res.index[0] # output: Timedelta('0 days 01:00:00') 
# or maybe the smallest difference 
res.index.min() # output: Timedelta('0 days 01:00:00') 




# get full datetime rng 
full_rng = pd.date_range(df.index[0], df.index[-1], freq=res.index[0]) 
full_rng 

DatetimeIndex(['2015-03-02 01:00:00', '2015-03-02 02:00:00', 
       '2015-03-02 03:00:00', '2015-03-02 04:00:00', 
       '2015-03-02 05:00:00', '2015-03-02 06:00:00', 
       '2015-03-02 07:00:00', '2015-03-02 08:00:00', 
       '2015-03-02 09:00:00', '2015-03-02 10:00:00', 
       ... 
       '2015-07-19 14:00:00', '2015-07-19 15:00:00', 
       '2015-07-19 16:00:00', '2015-07-19 17:00:00', 
       '2015-07-19 18:00:00', '2015-07-19 19:00:00', 
       '2015-07-19 20:00:00', '2015-07-19 21:00:00', 
       '2015-07-19 22:00:00', '2015-07-19 23:00:00'], 
       dtype='datetime64[ns]', length=3359, freq='H', tz=None) 
2

的最小時間差被發現與

np.diff(data.index.values).min() 

通常是以ns爲單位。爲了得到一個頻率,假設NS:

freq = 1e9/np.diff(df.index.values).min().astype(int) 
2

值得一提的是,如果數據是連續的,你可以使用pandas.DateTimeIndex.inferred_freq屬性:

dt_ix = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H') 
dt_ix._set_freq(None) 
dt_ix.inferred_freq 
Out[2]: 'H' 

pandas.infer_freq方法:

pd.infer_freq(dt_ix) 
Out[3]: 'H' 

如果不連續pandas.infer_freq將返回None。同樣於已經提出的是,另一種方法是使用pandas.Series.diff方法:

split_ix = dt_ix.drop(pd.date_range('2015-05-01 00:00:00','2015-05-30 00:00:00', freq='1H')) 
split_ix.to_series().diff().min() 
Out[4]: Timedelta('0 days 01:00:00')