2014-03-05 80 views
2

我有本帖子底部顯示的格式的時間系列數據。包含已用時間值的重新採樣熊貓時間系列

我想重新採樣數據到30分鐘的時間間隔,但我需要將狀態值的時間分割爲相應的正確間隔(這些值以整秒錶示)。

現在想象一下某一行的狀態時間爲2342秒(超過30分鐘),並說開始時間爲08:22:00。

User Start Date Start Time State Time in State (secs) 
J.Doe 03-02-2014 08:22:00 A  2342 

當重採樣做我需要在狀態的時間進行相應的分成溢出入段,像這樣:

User Start Date Time Period State Time in State (secs) 
J.Doe 03-02-2014 08:00:00 A  480 
J.Doe 03-02-2014 08:30:00 A  1800 
J.Doe 03-02-2014 09:00:00 A  62 

480 + 1800 + 62 = 2342

我如何在大熊貓做到這一點完全失去了...我希望得到任何幫助:-)

源數據格式:

User Start Date Start Time State Time in State (secs) 
J.Doe 03-02-2014 07:58:00 A  36 
J.Doe 03-02-2014 07:59:00 A  43 
J.Doe 03-02-2014 08:00:00 A  59 
J.Doe 03-02-2014 08:01:00 A  32 
J.Doe 03-02-2014 08:21:00 A  15 
J.Doe 03-02-2014 08:22:00 B  3 
J.Doe 03-02-2014 08:22:00 A  2342 
J.Doe 03-02-2014 09:01:00 B  1 
J.Doe 03-02-2014 09:01:00 A  375 
J.Doe 03-02-2014 09:07:00 B  3 
J.Doe 03-02-2014 09:07:00 A  6408 
J.Doe 03-02-2014 10:54:00 B  2 
J.Doe 03-02-2014 10:54:00 A  116 
J.Doe 03-02-2014 10:58:00 B  2 
J.Doe 03-02-2014 10:58:00 A  122 
J.Doe 03-02-2014 10:58:00 A  12 
J.Doe 03-02-2014 11:00:00 B  2 
J.Doe 03-02-2014 11:00:00 A  3417 
J.Doe 03-02-2014 11:57:00 B  3 
J.Doe 03-02-2014 11:57:00 A  120 
J.Doe 03-02-2014 11:59:00 C  165 
J.Doe 03-02-2014 12:02:00 B  3 
J.Doe 03-02-2014 12:02:00 A  7254 
+1

請問您能否說明您的示例中的2342爲什麼以及如何分區爲480,1600和62? –

+1

我認爲訣竅是提取開始和結束時間,並重新取樣,我認爲有一個烹飪書的例子是什麼在每個時期開關,這是一個(fiddly)這些例子的擴展... –

+0

@Paul H你是也許這不夠清楚。基本上,因爲2342秒從8:22開始,因此在每天的半小時內決定他們屬於哪裏時,我們會在8點到8點30分之間達到8分鐘(480秒)(因爲國家開始了8點22分,那段時間還剩8分鐘)。在8:30到9:00期間爲30分鐘(1800秒),在9:00到9:30期間爲62秒。 – pmanacas

回答

0

我會首先創建起始和終止列(如datetime64對象):

In [11]: df['Start'] = pd.to_datetime(df['Start Date'] + ' ' + df['Start Time']) 

In [12]: df['End'] = df['Start'] + df['Time in State (secs)'].apply(pd.offsets.Second) 

In [13]: row = df.iloc[6, :] 

In [14]: row 
Out[14]: 
User         J.Doe 
Start Date      03-02-2014 
Start Time       08:22:00 
State          A 
Time in State (secs)     2342 
Start     2014-03-02 08:22:00 
End      2014-03-02 09:01:02 
Name: 6, dtype: object 

一種方式來獲得分段時間是從開始和結束時重新取樣,合併,並使用DIFF:

def split_times(row): 
    y = pd.Series(0, [row['Start'], row['End']]) 
    splits = y.resample('30min').index + y.index # this fills in middle and sorts too 
    res = -splits.to_series().diff(-1) 
    if len(res) > 2: res = res[1:-1] 
    elif len(res) == 2: res = res[1:] 
    return res.astype(int).resample('30min').astype(np.timedelta64) # hack to resample again 

In [16]: split_times(row) 
Out[16]: 
2014-03-02 08:22:00 00:08:00 
2014-03-02 08:30:00 00:30:00 
2014-03-02 09:00:00 00:01:02 
dtype: timedelta64[ns] 

In [17]: df.apply(split_times, 1) 
Out[17]: 
    2014-03-02 07:30:00 2014-03-02 08:00:00 2014-03-02 08:30:00 2014-03-02 09:00:00 2014-03-02 09:30:00 2014-03-02 10:00:00 2014-03-02 10:30:00 2014-03-02 11:00:00 2014-03-02 11:30:00 2014-03-02 12:00:00 2014-03-02 12:30:00 2014-03-02 13:00:00 2014-03-02 13:30:00 2014-03-02 14:00:00 
0    00:00:36     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT 
1    00:00:43     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT 
2     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT 
3     NaT    00:00:32     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT 
4     NaT    00:00:15     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT 
5     NaT    00:00:03     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT 
6     NaT    00:08:00    00:30:00    00:01:02     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT 
7     NaT     NaT     NaT    00:00:01     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT 
8     NaT     NaT     NaT    00:06:15     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT 
9     NaT     NaT     NaT    00:00:03     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT 
10     NaT     NaT     NaT    00:23:00    00:30:00    00:30:00    00:23:48     NaT     NaT     NaT     NaT     NaT     NaT     NaT 
11     NaT     NaT     NaT     NaT     NaT     NaT    00:00:02     NaT     NaT     NaT     NaT     NaT     NaT     NaT 
12     NaT     NaT     NaT     NaT     NaT     NaT    00:01:56     NaT     NaT     NaT     NaT     NaT     NaT     NaT 
13     NaT     NaT     NaT     NaT     NaT     NaT    00:00:02     NaT     NaT     NaT     NaT     NaT     NaT     NaT 
14     NaT     NaT     NaT     NaT     NaT     NaT    00:02:00    00:00:02     NaT     NaT     NaT     NaT     NaT     NaT 
15     NaT     NaT     NaT     NaT     NaT     NaT    00:00:12     NaT     NaT     NaT     NaT     NaT     NaT     NaT 
16     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT 
17     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT    00:26:57     NaT     NaT     NaT     NaT     NaT 
18     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT    00:00:03     NaT     NaT     NaT     NaT     NaT 
19     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT    00:02:00     NaT     NaT     NaT     NaT     NaT 
20     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT    00:01:00    00:01:45     NaT     NaT     NaT     NaT 
21     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT    00:00:03     NaT     NaT     NaT     NaT 
22     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT     NaT    00:28:00    00:30:00    00:30:00    00:30:00    00:02:54 

爲0,它看起來像你必須做一些弄虛作假的0.13.1更換NATS(這可能已經處於主搞掂,否則是一個bug):

res2 = df.apply(split_times, 1).astype(int) 
# hack to replace NaTs with 0 
res2.where(res2 != -9223372036854775808, 0).astype(np.timedelta64) 
# to just get the seconds 
seconds = res2.where(res2 != -9223372036854775808, 0)/10 ** 9 
+0

當我嘗試做'df ['End'] = df ['Start'] + df ['State in state(secs)']。apply(pd.offsets.Second)'我得到一個錯誤:'ValueError:不能在一系列的系列中運行,其中使用的是numpy/pandas版本的typtime e datetime64 [ns]或timedelta' – pmanacas

+0

@pmanacas? –

+0

NumPy 1.7.1 Pandas 0.11.0(由於管理員權限的限制,我僅限於可移植的Python :-( – pmanacas

相關問題