2016-01-28 124 views
1

我需要比較兩個數據幀。一個記錄事件和其他故障。我必須標記是一個範圍failure.I內的事件給你舉個例子:按日期時間間隔比較兩個數據幀(python pandas)

DF1

failures initial end 
1 22/12/2015 3:31 22/12/2015 3:33 

我想要得到的(事件)

EventID arrivalTime 
3949362 22/12/2015 3:29 
3948289 22/12/2015 3:29 
3947252 22/12/2015 3:29 
3951196 22/12/2015 3:29 
3949908 22/12/2015 3:30 
3948820 22/12/2015 3:30 
3946194 22/12/2015 3:31 
3949364 22/12/2015 3:31 
3948292 22/12/2015 3:31 
3947774 22/12/2015 3:31 
3946736 22/12/2015 3:31 
3947254 22/12/2015 3:32 
3949366 22/12/2015 3:32 
3948294 22/12/2015 3:32 
3946196 22/12/2015 3:32 
3948824 22/12/2015 3:33 
3949909 22/12/2015 3:33 
3951200 22/12/2015 3:33 
3947255 22/12/2015 3:33 
3949368 22/12/2015 3:34 
3946198 22/12/2015 3:34 

DF2(失敗)以下結果:

EventID arrivalTime interval 
3949362 22/12/2015 3:29 0 
3948289 22/12/2015 3:29 0 
3947252 22/12/2015 3:29 0 
3951196 22/12/2015 3:29 0 
3949908 22/12/2015 3:30 0 
3948820 22/12/2015 3:30 0 
3946194 22/12/2015 3:31 1 
3949364 22/12/2015 3:31 1 
3948292 22/12/2015 3:31 1 
3947774 22/12/2015 3:31 1 
3946736 22/12/2015 3:31 1 
3947254 22/12/2015 3:32 1 
3949366 22/12/2015 3:32 1 
3948294 22/12/2015 3:32 1 
3946196 22/12/2015 3:32 1 
3948824 22/12/2015 3:33 1 
3949909 22/12/2015 3:33 1 
3951200 22/12/2015 3:33 1 
3947255 22/12/2015 3:33 0 
3949368 22/12/2015 3:34 0 
3946198 22/12/2015 3:34 0 

目前我做了兩個嵌套for,但我想更有效地做到這一點。

在此先感謝

+0

您的結果中是否存在拼寫錯誤?因爲這完全像你的原始事件表。你能確認嗎?你也可以請包括你試圖解決這個問題,你卡在哪裏? – Kartik

+0

正確的錯誤事件的時間間隔值3951200 = 1. – EduardoRL

+0

我目前使用的是一個函數。我手工輸入數據失敗。我正在學習熊貓和矢量化。我正在將我的代碼翻譯成熊貓。這是我的功能: – EduardoRL

回答

1

可以使用stack變革行柱,然後通過resample和最後numpy.whereisin添加缺少的initialend之間的數據:

df2 = df2.stack().reset_index(level=1) 
df2 = df2.drop('level_1', axis=1) 
df2.columns = ['Failuredate'] 
print df2 
      Failuredate 
0 2015-12-22 03:31:00 
0 2015-12-22 03:33:00 

#resample and fill missing data 
df2 = df2.groupby(df2.index).apply(lambda x: x.set_index('Failuredate'). 
         resample('1min', how='first',fill_method='ffill')).reset_index(level=1) 
print df2 
      Failuredate 
0 2015-12-22 03:31:00 
0 2015-12-22 03:32:00 
0 2015-12-22 03:33:00 
df1['interval'] = np.where(df1['arrivalTime'].isin(df2['Failuredate']), 1, 0) 
print df1 
    EventID   arrivalTime interval 
0 3949362 2015-12-22 03:29:00   0 
1 3948289 2015-12-22 03:29:00   0 
2 3947252 2015-12-22 03:29:00   0 
3 3951196 2015-12-22 03:29:00   0 
4 3949908 2015-12-22 03:30:00   0 
5 3948820 2015-12-22 03:30:00   0 
6 3946194 2015-12-22 03:31:00   1 
7 3949364 2015-12-22 03:31:00   1 
8 3948292 2015-12-22 03:31:00   1 
9 3947774 2015-12-22 03:31:00   1 
10 3946736 2015-12-22 03:31:00   1 
11 3947254 2015-12-22 03:32:00   1 
12 3949366 2015-12-22 03:32:00   1 
13 3948294 2015-12-22 03:32:00   1 
14 3946196 2015-12-22 03:32:00   1 
15 3948824 2015-12-22 03:33:00   1 
16 3949909 2015-12-22 03:33:00   1 
17 3951200 2015-12-22 03:33:00   1 
18 3947255 2015-12-22 03:33:00   1 
19 3949368 2015-12-22 03:34:00   0 
20 3946198 2015-12-22 03:34:00   0 
0

這我在矢量化之前使用的函數。

def periodofailure(df, failure): 

    l1 = [] 

    if failure == 'no': 
     l1 = ['no' for n in range(len(df))] 
     df1 = pd.DataFrame(l1) 
     frames = [df, df1] 
     df = pd.concat(frames, axis=1) 
     df.rename(columns={0:'Incidencia'}, inplace=True) 
     return df 

    elif failure == 'si': 
     # here I manually put failures 
     IFfailure1 = str('2015-12-02 06:30:00') 
     EFfailure1 = str('2015-12-02 06:42:00') 
     IFfailure2 = str('2015-12-10 18:53:00') 
     EFfailure2 = str('2015-12-10 19:05:00') 
     IFfailure3 = str('2015-12-11 21:18:00') 
     EFfailure3 = str('2015-12-12 00:09:00') 
     IFfailure4 = str('2015-12-15 11:45:00') 
     EFfailure4 = str('2015-12-15 12:17:00') 
     IFfailure5 = str('2015-11-18 12:28:00') 
     EFfailure5 = str('2015-11-18 12:59:00') 

     for index, row in df.iterrows(): 

      if IFfailure1 <= df['arrivalTime'][index] <= EFfailure1: 
       l1.append('si') 
      elif IFfailure2 <= df['arrivalTime'][index] <= EFfailure2: 
       l1.append('si') 
      elif IFfailure3 <= df['arrivalTime'][index] <= EFfailure3: 
       l1.append('si') 
      elif IFfailure4 <= df['arrivalTime'][index] <= EFfailure4: 
       l1.append('si') 
      elif IFfailure5 <= df['arrivalTime'][index] <= EFfailure5: 
       l1.append('si') 
      else: 
       l1.append('no') 

     df1 = pd.DataFrame(l1) 
     frames = [df, df1] 
     df = pd.concat(frames, axis=1) 
     df.rename(columns={0:'Incidencia'}, inplace=True) 
     return df