2017-08-07 87 views
1

我有一個數據幀熊貓其由以下各列的濱10分鐘樣品中

col1, col2, _time 

_time列是當該行中的時間發生的日期時間對象。

我想在兩列10分鐘的時間段內對我的數據幀進行重新採樣,併合計每10分鐘發生一次的組的行數。我想產生的數據框有以下幾列

col1 col2 since until count 

哪裏since是每10分鐘週期until的時間每10分鐘週期的結束的開頭和計數是哪裏的初始數據幀像發現的行數

col1 col2   since     until   count 
1  1  08/12/2017 12:00  08/12/2017 12:10  10 
1  2  08/12/2017 12:00  08/12/2017 12:10  5 
1  1  08/12/2017 12:10  08/12/2017 12:20  3 

這是可能的數據幀的resample方法?

+0

你能提供初始樣本數據? – Wen

回答

1

我也一直以來一直在尋找resample爲此,無濟於事。幸運的是,我找到了一個使用pd.Series.dt.floor的解決方案!

  1. 使用.dt.floor對齊您的時間戳至10分鐘間隔,
  2. 使用產生的對象中一個GROUPBY(或任選地,它在源數據分配給一列,並使用柱)
  3. 使用pd.to_timedelta從您since

例如計算until列,

import pandas as pd 

interval = '10min' # 10 minutes intervals, please 

# Dummy data with 3-minute intervals 
data = pd.DataFrame({ 
    'col1': [0, 0, 1, 0, 0, 0, 1, 0, 1, 1], 
    'col2': [4, 4, 4, 3, 4, 4, 3, 3, 4, 4], 
    '_time': pd.date_range(start='2010-01-01 00:01:00', freq='3min', periods=10), 
}) 

# Floor the timestamps to your desired interval 
since = data['_time'].dt.floor(interval).rename('since') 

# Get the size of each group - groups are in the index of `agg` 
agg = data.groupby(['col1', 'col2', since]).size() 
agg = agg.rename('count') 

# Back to dataframe 
agg = agg.reset_index() 

# Simply add your interval to `since` 
agg['until'] = agg['since'] + pd.to_timedelta(interval) 

print(agg) 

    col1 col2    since count    until 
0  0  3 2010-01-01 00:10:00  1 2010-01-01 00:20:00 
1  0  3 2010-01-01 00:20:00  1 2010-01-01 00:30:00 
2  0  4 2010-01-01 00:00:00  2 2010-01-01 00:10:00 
3  0  4 2010-01-01 00:10:00  2 2010-01-01 00:20:00 
4  1  3 2010-01-01 00:10:00  1 2010-01-01 00:20:00 
5  1  4 2010-01-01 00:00:00  1 2010-01-01 00:10:00 
6  1  4 2010-01-01 00:20:00  2 2010-01-01 00:30:00 
0

如果您仍在尋找答案,本示例可能會以某種方式幫助您。

import pandas as pd 
import numpy as np 
import datetime 

# create some random data 
df = pd.DataFrame(columns=["col1","col2","timestamp"]) 
df.col1 = np.random.randint(100, size = 10) 
df.col2 = np.random.randint(100, size = 10) 
df.timestamp = [datetime.datetime(2000,1,1) + \ 
      datetime.timedelta(hours=int(i)) for i in np.random.randint(100, size = 10)] 

# sort data by timestamp and reset index 
df = df.sort_values(by="timestamp").reset_index(drop=True) 

# create the bins by taking last first time and last time with freq 6h 
bins = pd.date_range(start=df.timestamp.values[0],end=df.timestamp.values[-1], freq="6h") # change to reasonable freq (d, h, m, s) 
# zip them to pairs 
startend = list(zip(bins, bins.shift(1))) 

# define a function that finds bin index 
def time_in_range(x): 
    """Return true if x is in the range [start, end]""" 
    for ind,(start,end) in enumerate(startend): 
     if start <= x <= end: 
      return ind 


# Add bin index to column named index 
df['index'] = df.timestamp.apply(time_in_range) 
# groupby index to find sum and count 
df = df.groupby('index')["col1","col2"].agg(['sum','count']).reset_index() 


# Create output df2 (with bins)   
df2 = pd.DataFrame(startend, columns=["start","end"]).reset_index() 

# Join the two dataframes with column index 
df3 =pd.merge(df2, df, how='outer', on='index').fillna(0) 

# Final adjustments 
df3.columns = ["index","start","end","col1","delete","col2","count"] 
df3.drop(['delete','index'], axis=1, inplace=True) 

輸出:

<table border="1" class="dataframe"> <thead> <tr style="text-align: right;">  <th></th>  <th>start</th>  <th>end</th>  <th>col1</th>  <th>col2</th>  <th>count</th> </tr> </thead> <tbody> <tr>  <th>0</th>  <td>2000-01-01 21:00:00</td>  <td>2000-01-02 03:00:00</td>  <td>89.0</td>  <td>136.0</td>  <td>2.0</td> </tr> <tr>  <th>1</th>  <td>2000-01-02 03:00:00</td>  <td>2000-01-02 09:00:00</td>  <td>0.0</td>  <td>0.0</td>  <td>0.0</td> </tr> <tr>  <th>2</th>  <td>2000-01-02 09:00:00</td>  <td>2000-01-02 15:00:00</td>  <td>69.0</td>  <td>27.0</td>  <td>1.0</td> </tr> <tr>  <th>3</th>  <td>2000-01-02 15:00:00</td>  <td>2000-01-02 21:00:00</td>  <td>0.0</td>  <td>0.0</td>  <td>0.0</td> </tr> <tr>  <th>4</th>  <td>2000-01-02 21:00:00</td>  <td>2000-01-03 03:00:00</td>  <td>0.0</td>  <td>0.0</td>  <td>0.0</td> </tr> <tr>  <th>5</th>  <td>2000-01-03 03:00:00</td>  <td>2000-01-03 09:00:00</td>  <td>0.0</td>  <td>0.0</td>  <td>0.0</td> </tr> <tr>  <th>6</th>  <td>2000-01-03 09:00:00</td>  <td>2000-01-03 15:00:00</td>  <td>108.0</td>  <td>57.0</td>  <td>2.0</td> </tr> <tr>  <th>7</th>  <td>2000-01-03 15:00:00</td>  <td>2000-01-03 21:00:00</td>  <td>35.0</td>  <td>85.0</td>  <td>2.0</td> </tr> <tr>  <th>8</th>  <td>2000-01-03 21:00:00</td>  <td>2000-01-04 03:00:00</td>  <td>102.0</td>  <td>92.0</td>  <td>2.0</td> </tr> <tr>  <th>9</th>  <td>2000-01-04 03:00:00</td>  <td>2000-01-04 09:00:00</td>  <td>0.0</td>  <td>0.0</td>  <td>0.0</td> </tr> <tr>  <th>10</th>  <td>2000-01-04 09:00:00</td>  <td>2000-01-04 15:00:00</td>  <td>0.0</td>  <td>0.0</td>  <td>0.0</td> </tr> <tr>  <th>11</th>  <td>2000-01-04 15:00:00</td>  <td>2000-01-04 21:00:00</td>  <td>91.0</td>  <td>3.0</td>  <td>1.0</td> </tr> </tbody></table>