2017-02-28 138 views
1

我正在嘗試讀取日誌並計算某個工作流的持續時間。因此,包含日誌數據框看起來是這樣的:行之間的熊貓數據框計算

Timestamp Workflow Status 
20:31:52  ABC   Started 
... 
... 
20:32:50  ABC   Completed 

爲了計算,我使用下面的代碼執行時間:

start_time = log_text[(log_text['Workflow']=='ABC') & (log_text['Category']=='Started')]['Timestamp'] 
compl_time = log_text[(log_text['Workflow']=='ABC') & (log_text['Category']=='Completed')]['Timestamp'] 
duration = compl_time - start_time 

和答案,我得到的是:

1 NaT 
72 NaT 
Name: Timestamp, dtype: timedelta64[ns] 

我認爲,由於指數不同,時差不能正確計算。當然,我可以用每行的指標明確得到正確的答案:

duration = compl_time.loc[72] - start_time[1]

但是,這似乎是做事情的方式不雅。有沒有更好的方法來完成相同的目標?

回答

0

你是對的,有不同的問題indexes,所以輸出不能對齊,並得到NaN s。

的simpliest是values轉換輸出到numpy array,但同時需要Series(這裏都是length == 1)的相同lenght,與boolean indexing選擇是更好地利用loc

mask = log_text['Workflow']=='ABC' 
start_time = log_text.loc[mask & (log_text['Status']=='Started'), 'Timestamp'] 
compl_time = log_text.loc[mask & (log_text['Status']=='Completed'),'Timestamp'] 

print (len(start_time)) 
1 
print (len(compl_time)) 
1 

duration = compl_time - start_time.values 

print (duration) 
1 00:00:58 
Name: Timestamp, dtype: timedelta64[ns] 

duration = compl_time.values - start_time.values 

print (pd.to_timedelta(duration)) 
TimedeltaIndex(['00:00:58'], dtype='timedelta64[ns]', freq=None) 

print (pd.Series(pd.to_timedelta(duration))) 
0 00:00:58 
dtype: timedelta64[ns]