2017-04-26 26 views
2

我試圖將一堆時間序列數據分組爲2小時。我對此非常陌生,所以請耐心等待。我想我可以根據以前的研究使用熊貓。Python&Pandas:以2h爲增量計算時間數據

我有一個數據集(數值指明MyTime),看起來像這樣:

['15:23', '14:41', '13:54', '07:13', '20:21', '13:15', '14:48', '12:06', '08:37', '06:32', '07:04', '14:20', '16:28',  
'06:49', '08:39', '09:15', '08:54', '05:37', '14:43', '06:20', '11:25', '11:05', '09:28', '14:05', '14:24', '15:30', 
'13:28', '16:55', '09:29', '17:44', '07:24', '09:37', '06:47', '14:35', '10:55', '22:29', '06:24', '09:25', '06:45', 
'23:49', '19:34', '01:31', '14:22', '13:58', '09:08', '05:11', '08:09', '08:52', '02:50', '12:51', '17:33', '07:07', 
'08:11', '10:06', '23:48', '22:27', '11:15', '15:09', '16:45', '20:42', '12:12', '07:08', '16:13', '20:40', '17:26', 
'18:57', '15:07', '09:19', '09:10', '09:17', '09:26', '14:18', '06:31', '14:13', '14:01', '08:57', '21:34'] 

我想利用這個數據集,基本上看到像這樣的輸出:

0-2: 4 
2-4: 7 
4-6: 3 
6-8: 3 
8-10: 2 
10-12: 5 
12-14: 14 
....etc 

這裏是一個子集我的代碼

import csv 
from collections import Counter 
import pandas as pd 
import numpy as np 

mycount = Counter() 
mytime = [] 
with open('temp_dates.csv') as csvfile2: 
    readCSV2 = csv.reader(csvfile2, delimiter=',') 
    incoming = [] 
    for row in readCSV2: 
     readin = row[0] 
     time = row[1] 
     year, month, day = (int(x) for x in readin.split('-')) 
     ans = datetime.date(year, month, day) 
     wkday = ans.strftime("%A") 
     incoming.append([wkday,time]) 
     mycount[wkday] += 1 
     mytime.append(time) 
    with open('new_dates2.csv', 'w') as out_file: 
     writer = csv.writer(out_file) 
     writer.writerows(incoming) 
csvfile2.close() 

for key,value in sorted(mycount.iteritems()): 
    daylist = key, value 
    print(daylist) 

#print(mytime) 
df = pd.DataFrame() 
#print(df) 
df.groupby([df['mytime'],pd.TimeGrouper(freq='2H')]) 

我猜我的第一個問題是數據沒有正確格式化爲TimeGrouper聯合國derstand?其次,我可能錯過了一些告訴數據框看什麼的東西?任何幫助,將不勝感激。

通過請求的原始源CSV文件的片段如下(我們只是談論填充到'mytime'的第2列)。

Sunday,14:35 
Sunday,10:55 
Friday,22:29 
Friday,06:24 
Thursday,09:25 
Wednesday,06:45 
+0

這是一個有點混亂。你的第一個陳述是你有一個時間表,但第一個代碼是從csv構建日期。我猜測列表mytime包含數據,只有最後兩行是實際問題? – Ben

+0

請提供樣本可重現的原始格式(CSV)數據集 – MaxU

+0

mytime是我試圖拉數據 - 它是從CSV文件(行[1])填充。上面的數據列表直接從mytime – Justin

回答

1

UPDATE:

In [96]: mytime = ['15:23', '14:41', '13:54', '07:13', '20:21', '13:15', '14:48', '12:06', '08:37', '06:32', '07:04', '14:20', '16:28', 
    ...: 
    ...: '06:49', '08:39', '09:15', '08:54', '05:37', '14:43', '06:20', '11:25', '11:05', '09:28', '14:05', '14:24', '15:30', 
    ...: '13:28', '16:55', '09:29', '17:44', '07:24', '09:37', '06:47', '14:35', '10:55', '22:29', '06:24', '09:25', '06:45', 
    ...: '23:49', '19:34', '01:31', '14:22', '13:58', '09:08', '05:11', '08:09', '08:52', '02:50', '12:51', '17:33', '07:07', 
    ...: '08:11', '10:06', '23:48', '22:27', '11:15', '15:09', '16:45', '20:42', '12:12', '07:08', '16:13', '20:40', '17:26', 
    ...: '18:57', '15:07', '09:19', '09:10', '09:17', '09:26', '14:18', '06:31', '14:13', '14:01', '08:57', '21:34'] 

In [97]: s = pd.to_datetime(mytime).to_series() 

In [98]: s 
Out[98]: 
2017-04-26 15:23:00 2017-04-26 15:23:00 
2017-04-26 14:41:00 2017-04-26 14:41:00 
2017-04-26 13:54:00 2017-04-26 13:54:00 
2017-04-26 07:13:00 2017-04-26 07:13:00 
2017-04-26 20:21:00 2017-04-26 20:21:00 
2017-04-26 13:15:00 2017-04-26 13:15:00 
2017-04-26 14:48:00 2017-04-26 14:48:00 
2017-04-26 12:06:00 2017-04-26 12:06:00 
2017-04-26 08:37:00 2017-04-26 08:37:00 
2017-04-26 06:32:00 2017-04-26 06:32:00 
           ... 
2017-04-26 09:19:00 2017-04-26 09:19:00 
2017-04-26 09:10:00 2017-04-26 09:10:00 
2017-04-26 09:17:00 2017-04-26 09:17:00 
2017-04-26 09:26:00 2017-04-26 09:26:00 
2017-04-26 14:18:00 2017-04-26 14:18:00 
2017-04-26 06:31:00 2017-04-26 06:31:00 
2017-04-26 14:13:00 2017-04-26 14:13:00 
2017-04-26 14:01:00 2017-04-26 14:01:00 
2017-04-26 08:57:00 2017-04-26 08:57:00 
2017-04-26 21:34:00 2017-04-26 21:34:00 
dtype: datetime64[ns] 

In [106]: s.groupby(pd.cut(s.dt.hour, 
    ...:     bins=np.arange(26, step=2), 
    ...:     right=False, 
    ...:     include_lowest=True)) \ 
    ...: .size() 
    ...: 
Out[106]: 
[0, 2)  1 
[2, 4)  1 
[4, 6)  2 
[6, 8)  12 
[8, 10)  17 
[10, 12)  5 
[12, 14)  7 
[14, 16) 15 
[16, 18)  7 
[18, 20)  2 
[20, 22)  4 
[22, 24)  4 
dtype: int64 

df = pd.read_csv('/path/to/file.csv', parse_dates=[1], names=['date','time']) 

In [55]: df 
Out[55]: 
     date    time 
0  Sunday 2017-04-26 14:35:00 
1  Sunday 2017-04-26 10:55:00 
2  Friday 2017-04-26 22:29:00 
3  Friday 2017-04-26 06:24:00 
4 Thursday 2017-04-26 09:25:00 
5 Wednesday 2017-04-26 06:45:00 

In [59]: df.groupby(pd.cut(df.time.dt.hour, bins=np.arange(26, step=2), include_lowest=True)).size() 
Out[59]: 
time 
[0, 2]  0 
(2, 4]  0 
(4, 6]  2 
(6, 8]  0 
(8, 10]  2 
(10, 12] 0 
(12, 14] 1 
(14, 16] 0 
(16, 18] 0 
(18, 20] 0 
(20, 22] 1 
(22, 24] 0 
dtype: int64 
+0

Thanks Max。所以在我的情況下,我想直接從mytime對象中獲取。那麼我可以將數據框設置爲df = pd.DataFrame(mytime)之類的東西嗎? – Justin

+0

@Justin,當然,你可以做到這一點(見更新)。但是使用Pandas來解析你的數據要容易得多,效率也要高得多... ;-) – MaxU

+0

很酷 - 那個修改過的groupby就是這麼做的!謝謝...! – Justin

0

這是我得到了什麼,用排序還在掙扎,你會看到輸出:

data = ['15:23', '14:41', '13:54', '07:13', '20:21', '13:15', '14:48', '12:06', '08:37', '06:32', '07:04', '14:20', '16:28',  
'06:49', '08:39', '09:15', '08:54', '05:37', '14:43', '06:20', '11:25', '11:05', '09:28', '14:05', '14:24', '15:30', 
'13:28', '16:55', '09:29', '17:44', '07:24', '09:37', '06:47', '14:35', '10:55', '22:29', '06:24', '09:25', '06:45', 
'23:49', '19:34', '01:31', '14:22', '13:58', '09:08', '05:11', '08:09', '08:52', '02:50', '12:51', '17:33', '07:07', 
'08:11', '10:06', '23:48', '22:27', '11:15', '15:09', '16:45', '20:42', '12:12', '07:08', '16:13', '20:40', '17:26', 
'18:57', '15:07', '09:19', '09:10', '09:17', '09:26', '14:18', '06:31', '14:13', '14:01', '08:57', '21:34'] 


import pandas as pd 

df = pd.DataFrame({'mytime': data}) 

df['mytime'] = pd.to_datetime(df['mytime']).dt.floor('2H').dt.time 
df['hour'] = df.mytime.apply(lambda x: str(x.hour) + '-' + str(x.hour +2)) 
df = df.groupby('hour').size() 
0

這裏是使用numpy的h的一種方法istogram功能:

import numpy as np 
data = ['15:23', '14:41', '13:54', '07:13', '20:21', '13:15', '14:48', '12:06', '08:37', '06:32', '07:04', '14:20', '16:28','06:49', '08:39', '09:15','08:54', '05:37', '14:43', '06:20', '11:25', '11:05', '09:28', '14:05','14:24', '15:30', '13:28', '16:55', '09:29', '17:44', '07:24', '09:37','06:47', '14:35', '10:55', '22:29', '06:24', '09:25', '06:45', '23:49','19:34', '01:31', '14:22', '13:58', '09:08', '05:11', '08:09', '08:52','02:50', '12:51', '17:33', '07:07', '08:11', '10:06', '23:48', '22:27','11:15', '15:09', '16:45', '20:42', '12:12', '07:08', '16:13', '20:40','17:26', '18:57', '15:07', '09:19', '09:10', '09:17', '09:26', '14:18', '06:31', '14:13', '14:01', '08:57', '21:34'] 
time = [int(h) + int(m)/60 for h, m in (y.split(':') for y in data)] 
bins = list(range(0, 26, 2)) 
counts, bins = np.histogram(time, bins) 
dict(zip(bins, counts)) 

結果:

{0: 1, 
2: 1, 
4: 2, 
6: 12, 
8: 17, 
10: 5, 
12: 7, 
14: 15, 
16: 7, 
18: 2, 
20: 4, 
22: 4} 
+0

謝謝。所以在我的情況下,我可以只設置「data = mytime」,因爲mytime的值等於您的手動輸入? – Justin

+0

是的,這應該工作。 –