2016-09-29 110 views
0

操作我更新我必須使用大熊貓,並充分利用其功能優勢的實現,我希望得到一些幫助。我有事件的熊貓數據框,看起來像這樣:對於開始結束花式時間序列分組與熊貓據幀

 ID    Start     End 
0 243552 2010-12-12 23:00:53 2010-12-12 23:37:14 
1 243621 2010-12-12 23:25:58 2010-12-13 02:20:40 
2 243580 2010-12-12 23:39:19 2010-12-13 07:22:39 
3 243579 2010-12-12 23:42:53 2010-12-13 05:40:14 
4 243491 2010-12-12 23:43:53 2010-12-13 07:48:14 
... 
... 

Dtypes是int64IDdatetime64[ns]。需要注意的是數據框在開始列排序,但它不一定會在結束列進行排序。

欲分析這些數據在一段時間範圍之間的輸入時間戳用於由用戶等於輸入時間跨度期間T1T2,併產生一個新的數據幀由這些週期的時間戳索引。

我想這樣做的是組每期產生5列中的數據:START_COUNTEND_COUNTSpan_avgStart_inter_avgEnd_inter_avg。考慮到,例如,10分鐘的時間段分組我想這一點:

     Start_count End_count  Span_avg Start_inter_avg End_inter_avg 
Period 
2010-12-12 23:10:00   1   0  00:36:21   00:00:00  00:00:00 
2010-12-12 23:20:00   0   0    0   00:00:00  00:00:00 
2010-12-12 23:30:00   1   0  02:54:42   00:00:00  00:00:00 
2010-12-12 23:40:00   1   1  07:43:20   00:00:00  00:00:00 
2010-12-12 23:50:00   2   0  07:00:51   00:01:00  00:00:00 
... 
... 

凡dtypes會:int64START_COUNTEND_COUNTtimedelta64[ns]Span_avgStart_inter_avgEnd_inter_avg。我想生產數據框的欄目有:

  • START_COUNT:從時間跨度]Period - 10 min, Period]時期下屬於原始數據框的開始時間戳列的數量;
  • END_COUNT:同START_COUNT但考慮到結束柱;
  • Span_average:計算如下:第一)看從數據幀的條目,並選擇那些具有的開始值包含內部]Period - 10 min, Period],第2次)在每一個這些項的計算的差 - 開始,第3次)平均這些值。
  • Start_inter_avg:計算是這樣的:1)看,從數據幀中的條目,並選擇那些具有內部包含]Period - 10 min, Period]開始值,並進行排序(當然,他們已經排序),第2)計算連續時間戳之間的timedelta差異,3rd)平均這些差異。(所以如果在某個時間段內有3個時間差,[ba,cb]和最終值將等於((ba)+(cb)),則在時間戳[a,b,c] ))/ 2)。
  • End_inter_avg:應以同樣的方式被計算爲Start_inter_avg但使用從結束列中的數據。 (請注意,現在需要預先分類)。

例如,30分鐘的時間段分組時所得到的表應該是:

     Start_count End_count  Span_avg Start_inter_avg End_inter_avg 
Period 
2010-12-12 23:30:00   2   0 01:45:31.500   00:25:05  00:00:00 
2010-12-13 00:00:00   3   1 07:15:00.666   00:02:17  00:00:00 
... 
... 

可以與此test.csv文件實驗:

ID,Start,End 
243552,2010-12-12 23:00:53,2010-12-12 23:37:14 
243621,2010-12-12 23:25:58,2010-12-13 02:20:40 
243580,2010-12-12 23:39:19,2010-12-13 07:22:39 
243579,2010-12-12 23:42:53,2010-12-13 05:40:14 
243491,2010-12-12 23:43:53,2010-12-13 07:48:14 
243490,2010-12-12 23:43:58,2010-12-13 01:18:40 
243465,2010-12-13 00:07:53,2010-12-13 07:26:14 
243515,2010-12-13 00:35:58,2010-12-13 03:41:40 
243572,2010-12-13 00:46:58,2010-12-13 03:47:40 
243520,2010-12-13 01:15:53,2010-12-13 05:14:14 
243609,2010-12-13 01:29:53,2010-12-13 08:10:14 
243482,2010-12-13 01:44:19,2010-12-13 05:57:39 
243563,2010-12-13 01:49:53,2010-12-13 06:04:14 
243414,2010-12-13 02:06:16,2010-12-13 02:46:48 
243441,2010-12-13 02:15:16,2010-12-13 03:11:48 
243548,2010-12-13 02:33:58,2010-12-13 02:49:40 
243447,2010-12-13 05:01:42,2010-12-13 21:55:21 
243531,2010-12-13 05:53:25,2010-12-13 07:49:59 
243583,2010-12-13 05:53:25,2010-12-13 09:00:59 
243593,2010-12-13 06:06:25,2010-12-13 09:50:59 
243460,2010-12-13 06:14:42,2010-12-13 18:14:44 
243596,2010-12-13 06:15:10,2010-12-13 21:47:25 
243575,2010-12-13 06:22:42,2010-12-13 20:51:21 
243514,2010-12-13 06:24:14,2010-12-13 08:34:07 
243421,2010-12-13 06:31:14,2010-12-13 10:57:07 
243471,2010-12-13 06:35:23,2010-12-13 14:11:13 
243518,2010-12-13 06:36:48,2010-12-13 17:35:39 
243565,2010-12-13 06:37:43,2010-12-13 17:16:22 
243564,2010-12-13 06:48:16,2010-12-13 16:18:15 
243424,2010-12-13 06:48:48,2010-12-13 16:19:39 
243437,2010-12-13 06:58:46,2010-12-13 17:11:30 
243573,2010-12-13 07:00:14,2010-12-13 09:46:07 
243585,2010-12-13 07:01:35,2010-12-13 09:01:38 
243483,2010-12-13 07:02:16,2010-12-13 16:36:15 
243425,2010-12-13 07:04:21,2010-12-13 16:03:50 
243570,2010-12-13 07:07:48,2010-12-13 08:51:04 
243507,2010-12-13 07:10:03,2010-12-13 15:58:48 
243535,2010-12-13 07:10:23,2010-12-13 11:31:13 
243502,2010-12-13 07:13:21,2010-12-13 19:06:50 
243525,2010-12-13 07:13:21,2010-12-13 19:34:50 
243486,2010-12-13 07:13:56,2010-12-13 17:49:38 
243451,2010-12-13 07:15:58,2010-12-13 17:34:03 
243485,2010-12-13 07:17:35,2010-12-13 09:40:38 
243487,2010-12-13 07:19:01,2010-12-13 10:39:35 
243522,2010-12-13 07:19:25,2010-12-13 18:03:02 
243481,2010-12-13 07:19:48,2010-12-13 11:08:04 
243545,2010-12-13 07:20:42,2010-12-13 20:38:44 
243492,2010-12-13 07:23:07,2010-12-13 17:38:42 
243611,2010-12-13 07:23:23,2010-12-13 12:58:13 
243508,2010-12-13 07:25:25,2010-12-13 18:29:02 
243620,2010-12-13 07:25:46,2010-12-13 17:51:30 
243466,2010-12-13 07:27:40,2010-12-13 19:05:58 
243582,2010-12-13 07:29:29,2010-12-13 20:08:10 
243568,2010-12-13 07:31:17,2010-12-13 15:30:37 
243461,2010-12-13 07:32:24,2010-12-13 20:47:52 
243623,2010-12-13 07:33:10,2010-12-13 10:34:20 
243498,2010-12-13 07:33:25,2010-12-13 16:22:02 
243427,2010-12-13 07:33:48,2010-12-13 20:00:39 
243526,2010-12-13 07:34:10,2010-12-13 09:46:20 
243472,2010-12-13 07:36:10,2010-12-13 20:36:25 
243479,2010-12-13 07:36:48,2010-12-13 19:30:39 
243494,2010-12-13 07:39:07,2010-12-13 17:03:42 
243433,2010-12-13 07:39:35,2010-12-13 09:19:38 
243503,2010-12-13 07:40:06,2010-12-13 13:53:08 
243429,2010-12-13 07:40:35,2010-12-13 10:54:38 
243422,2010-12-13 07:43:23,2010-12-13 10:35:10 
243618,2010-12-13 07:46:19,2010-12-13 11:56:40 
243445,2010-12-13 07:48:14,2010-12-13 10:15:07 
243554,2010-12-13 07:49:14,2010-12-13 09:11:57 
243542,2010-12-13 07:49:17,2010-12-13 18:53:37 
243501,2010-12-13 07:50:40,2010-12-13 19:29:58 
243529,2010-12-13 07:51:18,2010-12-13 17:14:15 
243457,2010-12-13 07:53:55,2010-12-13 15:33:27 
243613,2010-12-13 07:53:58,2010-12-13 17:00:03 
243562,2010-12-13 07:54:01,2010-12-13 14:17:09 
243571,2010-12-13 07:54:48,2010-12-13 18:39:39 
243541,2010-12-13 07:58:53,2010-12-13 16:02:23 
243510,2010-12-13 07:59:10,2010-12-13 19:04:51 
243470,2010-12-13 07:59:46,2010-12-13 17:06:30 
243448,2010-12-13 07:59:48,2010-12-13 18:38:39 
243606,2010-12-13 08:03:21,2010-12-13 18:07:50 
243430,2010-12-13 08:04:08,2010-12-13 17:49:41 
243495,2010-12-13 08:04:25,2010-12-13 18:15:02 
243591,2010-12-13 08:07:08,2010-12-13 17:33:54 
243551,2010-12-13 08:07:10,2010-12-13 18:18:25 
243459,2010-12-13 08:10:14,2010-12-13 10:53:07 
243558,2010-12-13 08:11:00,2010-12-13 11:56:01 
243605,2010-12-13 08:13:20,2010-12-13 16:38:14 
243452,2010-12-13 08:15:23,2010-12-13 13:50:13 
243446,2010-12-13 08:17:06,2010-12-13 14:00:08 
243516,2010-12-13 08:17:20,2010-12-13 15:03:14 
243450,2010-12-13 08:18:17,2010-12-13 16:21:37 
243473,2010-12-13 08:19:22,2010-12-13 12:07:49 
243438,2010-12-13 08:20:10,2010-12-13 19:34:25 
243464,2010-12-13 08:21:03,2010-12-13 14:44:48 
243536,2010-12-13 08:21:29,2010-12-13 17:32:15 
243476,2010-12-13 08:21:58,2010-12-13 17:34:03 
243595,2010-12-13 08:24:19,2010-12-13 11:38:40 
243532,2010-12-13 08:27:10,2010-12-13 20:28:25 
243497,2010-12-13 08:27:20,2010-12-13 14:12:14 

嘗試在解決方案(回答問題的一部分)

這是我的解決方案。我只是做前3列,我得到Start_countEnd_countfloat64 dtype,我索引數據的週期時間戳的第一個邊界(不同於我問,但確定),總體而言,我不知道它是否可以以更簡單,更短,更優雅的方式完成。

# Loading and parsing 
data = pd.read_csv('test') 
data.Start = pd.to_datetime(data.Start, format='%Y-%m-%d %H:%M:%S') 
data.End = pd.to_datetime(data.End, format='%Y-%m-%d %H:%M:%S') 


interval = 10 # minutes 

Start_count = pd.Series(1, index=data.Start) 
Start_count = Start_count.resample(str(interval)+'t').count() 

# End_count series doesn't have the same length as Start_count 
End_count = pd.Series(1, index=data.End) 
End_count = End_count.resample(str(interval)+'t').count() 

# This is an ugly way of going around encountered issues and doing what I wanted 
Span = pd.Series(np.float64((data.End - data.Start)/np.timedelta64(1,'s')), index=data.Start) 
Span_mean = Span.resample(str(interval)+'t').mean() 
Span_mean = pd.to_timedelta(Span_mean, unit='s') 

# When merging all series in a dataframe it seems that alignment is properly done 
new_dataframe = pd.DataFrame(({'Start_count' : Start_count, 'End_count' : End_count, 'Span_avg' : Span_mean})) 
new_dataframe.fillna(0,inplace=True) 
new_dataframe.index.rename('Periods',inplace=True) 

new_dataframe.head() # Shows: 

        End_count Span_avg Start_count 
Periods            
2010-12-12 23:00:00  0.0 00:36:21   1.0 
2010-12-12 23:10:00  0.0 00:00:00   0.0 
2010-12-12 23:20:00  0.0 02:54:42   1.0 
2010-12-12 23:30:00  1.0 07:43:20   1.0 
2010-12-12 23:40:00  0.0 05:12:08   3.0 

回答

1

這是一個困難的問題,但這裏是解決方案:

import pandas as pd 

period = "10min" 

df = pd.read_csv("test.csv", parse_dates=[1, 2]) 
span = df.End - df.Start 
start_period = df.Start.dt.floor(period) 
end_period = df.End.dt.floor(period) 

start_count = start_period.value_counts(sort=False) 
end_count = end_period.value_counts(sort=False) 
span_average = pd.to_timedelta(
    span.dt.total_seconds().groupby(start_period).mean().round(), 
    unit="s").rename("Span_average") 

def average_span(s): 
    if len(s) > 1: 
     return (s.max() - s.min()).total_seconds()/(len(s) - 1) 
    else: 
     return 0 

start_inter_avg = pd.to_timedelta(
    df.Start.groupby(start_period).agg(average_span).round(), 
    unit="s").rename("Start_inter_avg") 

end_inter_avg = pd.to_timedelta(
    df.End.groupby(end_period).agg(average_span).round(), 
    unit="s").rename("End_inter_avg") 

res = pd.concat([start_count, end_count, span_average, start_inter_avg, end_inter_avg], 
       axis=1).resample(period).asfreq().fillna(0) 

輸出:

     Start End Span_average Start_inter_avg End_inter_avg 
2010-12-12 23:00:00 1.0 0.0  00:36:21   00:00:00  00:00:00 
2010-12-12 23:10:00 0.0 0.0  00:00:00   00:00:00  00:00:00 
2010-12-12 23:20:00 1.0 0.0  02:54:42   00:00:00  00:00:00 
2010-12-12 23:30:00 1.0 1.0  07:43:20   00:00:00  00:00:00 
2010-12-12 23:40:00 3.0 0.0  05:12:08   00:00:32  00:00:00 
2010-12-12 23:50:00 0.0 0.0  00:00:00   00:00:00  00:00:00 
2010-12-13 00:00:00 1.0 0.0  07:18:21   00:00:00  00:00:00 
2010-12-13 00:10:00 0.0 0.0  00:00:00   00:00:00  00:00:00 
+0

感謝,偉大的工作!我最近剛開始使用熊貓,我需要提高對某些功能的理解。我非常喜歡你的解決方案。 – PDRX