這是我的第一篇文章，所以請溫柔。我在全球範圍內搜索尋找解決方案，但我還沒有找到解決方案。我試圖解決的問題如下：Pandas.SHIFT多索引框架時間依賴

我有一個數據集，包含500.000+個樣本，每個樣本有6個特徵。
我已經把這個數據集在multiindexed熊貓數據幀

我的數據幀的第一個層次是時間序列索引，第二個層次是ID。它看起來如下

Time       id 
2017-03-07 10:06:49.963241984 122.0 -7.024347 
           136.0 -11.664985 
           243.0  1.716150 
2017-03-07 10:06:50.003462400 122.0 -7.025922 
           136.0 -11.671526

每個時間戳，可以看到一些對象，並用標記'id'標記。對於我的應用程序，我想通過包含5秒前發生的信息添加時間依賴關係，即在此示例中的時間戳10:06:45。但是，重要的是，我只想添加此信息，如果在該時間戳對象已經存在（所以如果ID是相等的）。

我想使用的功能dataframe.shift，如前所述here和，我想這樣做每升一級，從而在How do you shift Pandas DataFrame with a multiindex?

我的問題由用戶Unutbu表示如下：如何追加向原始數據框X添加額外的列，並提供關於這些對象是5年前的內容的信息。我會期望像下面的東西

X['x_location_shifted'] = X.groupby(level=1)['x_location'].shift(5*rate)

與率是25Hz，即。我們移動了125個「DateTimeIndices」，但是，只有在該時間戳處存在具有id ='...'的對象時。

編輯：時間戳不是100％同步的，所以時間間隔並不總是正好等於0.04。以前，我使用np.argmin（np.abs（time-index））來找到最接近郵票的索引。

例如，在我的設置中，在時間戳2017-03-07 10：36：03.605008640有一個id == 175和location_x = 54.323的對象。

id = 175 
X.ix['2017-03-07 10:36:03.605008640', id] = 54.323

在時間戳2017年3月7日10：36：08.604962560 .....此目的使用id = 175具有location_x = 67.165955

id = 175 
old_time = pd.to_datetime('2017-03-07 10:36:03.605008640') 
new_time = old_time + pd.Timedelta('5 seconds') 

# Finding the new value of location 
X.ix[np.argmin(np.abs(new_time - X.index.get_level_values(0))), id]

所以，最後，在時間步10： 36:08我想添加timestamp的信息10:36:03如果對象已經存在於該時間戳。

編輯2：在嘗試MaartenFabré的解決方案後，我想出了自己的實現，您可以在下面找到它。如果任何人都可以向我展示更加pythonic的方式來做到這一點，請讓我知道。

for current_time in X.index.get_level_values(0)[125:]: 
    #only do if there are objects at current time 
    if len(X.ix[current_time].index): 
     # Calculate past time 
     past_time = current_time - pd.Timedelta('5 seconds') 
     # Find index in X.index that is closest to this past time 
     past_time_index = np.argmin(np.abs(past_time-X.index.get_level_values(0))) 
     # translate the index back to a label 
     past_time = X.index[past_time_index][0] 
     # in that timestep, cycle the objects 
     for obj_id in X.ix[current_time].index: 
      # Try looking for the value box_center.x of obj obj_id 5s ago 
      try: 
       X.ix[(current_time, obj_id), 'box_center.x.shifted'] = X.ix[(past_time, obj_id), 'box_center.x'] 
       X.ix[(current_time, obj_id), 'box_center.y.shifted'] = X.ix[(past_time, obj_id), 'box_center.y'] 
       X.ix[(current_time, obj_id), 'relative_velocity.x.shifted'] = X.ix[(past_time, obj_id), 'relative_velocity.x'] 
       X.ix[(current_time, obj_id), 'relative_velocity.y.shifted'] = X.ix[(past_time, obj_id), 'relative_velocity.y'] 
      # If the key doesnt exist, the object doesn't exist, ergo the field should be np.nan 
      except KeyError: 
       X.ix[(current_time, obj_id), 'box_center.x.shift'] = np.nan 
     print('Timestep {}'.format(current_time))

如果沒有足夠的信息，請告訴我，我可以添加它:)

歡呼和感謝！

來源

2017-05-08 Floris Remmen

5s差異的確切程度如何，因爲在這裏我們看到的示例數據中數據之間只有0.04s。你能否提供一些5s差距適用的樣本數據？ –

我加了要求的信息 –

假設您在時間戳中沒有間隔，一種可能的解決方案可能是以下內容，它會創建一個帶有時間戳的新索引，並使用它獲取每個ID的5秒前值。

offset = 5 * rate 
# Create a shallow copy of the multiindex levels for modification 
modified_levels = list(X.index.levels) 
# Shift them 
modified_times = pd.Series(modified_levels[0]).shift(offset) 
# Fill NaNs with dummy values to avoid duplicates in the new index 
modified_times[modified_times.isnull()] = range(sum(modified_times.isnull())) 
modified_levels[0] = modified_times 
new_index = X.index.set_levels(modified_levels, inplace=False) 
X['x_location_shifted'] = X.loc[new_index, 'x_location'].values

來源

2017-05-08 10:49:26

我得到你想要做的。這個數據是否在5秒的時間點上添加並不重要，它也可以是4.995或5.005。但是，我收到以下錯誤w.r.t.我的索引：「***例外：無法處理非唯一的多索引！」 –

因爲我們正在移動5秒，所以這個實際上是有意義的，數據幀的最後5秒將變成非時間（NaT）值。然後，我們嘗試基於存在多次的NaT值進行本地化。 –

確實如此，情況似乎如此。我通過用獨特的虛擬值替換NaN來添加一個黑客解決方案。 –

如果時間戳不是100％的普通，那麼你要麼必須舍入到最近的1/X秒，或使用循環

，你可以使用它作爲一個循環

數據定義

import pandas as pd 
import numpy as np 
from io import StringIO 

df_str = """ 
timestamp id location 
10:00:00.005 1 a 
10:00:00.005 2 b 
10:00:00.005 3 c 
10:00:05.006 2 a 
10:00:05.006 3 b 
10:00:05.006 4 c""" 
df = pd.DataFrame.from_csv(StringIO(df_str), sep='\t').reset_index() 

delta = pd.to_timedelta(5, unit='s') 
margin = pd.to_timedelta(1/50, unit='s') 


df['location_shifted'] = np.nan

遍歷不同ID的

for label_id in set(df['id']): 
    df_id = df[df['id'] == label_id].copy() # copy to make sure we don't overwrite the original data. Might not be necessary 
    df_id['time_shift'] = df['timestamp'] + delta 
    for row in df_id.itertuples(): 
     idx = row.Index 
     time_dif = abs(df['timestamp'] - row.time_shift) 
     shifted_locs = df_id[time_dif < margin ] 
     l = len(shifted_locs) 
     if l: 
      print(shifted_locs) 
      if l == 1: 
       idx_shift = shifted_locs.index[0] 
      else: 
       idx_shift = shifted_locs['time_shift'].idxmin() 
      df.loc[idx_shift, 'location_shifted'] = df_id.loc[idx, 'location']

結果

 timestamp     id location location_shifted 
0 2017-05-09 10:00:00.005 1  a   
1 2017-05-09 10:00:00.005 2  b   
2 2017-05-09 10:00:00.005 3  c   
3 2017-05-09 10:00:05.006 2  a   b 
4 2017-05-09 10:00:05.006 3  b   c 
5 2017-05-09 10:00:05.006 4  c

來源

2017-05-09 09:46:32

嗨馬爾滕。我試過你的解決方案，因爲它看起來像一個可能性。但是，經過12小時以上的處理腳本尚未完成。我很可能會將腳本更改爲以最簡單的方式工作：循環所有時間戳，查找對象，查看過去是否存在該對象，如果是，則複製數據。我將代碼添加到了我的帖子中。 –

任何你到達這裏同樣的問題的;我設法以（最小）矢量化的方式解決它，但是，它要求我回到3d面板。

3個步驟： - 製作成3D面板 - 添加新的列 - 填補這些列

從多指標2D幀中有可能將其更改爲您轉換第二指數pandas.Panel到面板中的一個軸。

在此之後，我有一個3D面板與軸[時間，對象，參數]。然後，轉動面板以將PARAMETERS作爲項目，以便將列添加到數據面板。所以，轉移面板，添加列，轉置回來。現在

dp_new = dp.transpose(2,0,1) 
dp_new['shifted_box_center_x']=np.nan 
dp_new['shifted_box_center_y']=np.nan 
dp_new['shifted_relative_velocity_x']=np.nan 
dp_new['shifted_relative_velocity_y']=np.nan 
# tranpose them back to their original form 
dp_new = dp_new.transpose(1,2,0)

我們已經添加了新的領域，我們可以通過

new_fields = dp_new.minor_axis[-4:]

得到他們的名字的目的是從5秒添加信息前，如果該對象存在。因此，我們從時間的5秒開始循環時間序列。在我的情況下，在25Hz的速率，這是元件5 *率= 125

允許第一設定爲5秒在dataPanel上開始的時間

time = dp_new.items[125:]

然後，我們迭代的枚舉版本時間。枚舉將從0開始，這是時間步= 0時數據面板的索引。然而，第一個時間步是時間0 + 5秒處的時間步。

time = dp_new.items[125:] 
for iloc, ts in enumerate(time): 
    # Print progress 
    print('{} out of {}'.format(ts, dp.items[-1]) , end="\r", flush=True) 

    # Generate new INDEX field, by taking the field ID and dropping the NaN values 
    ids = dp_new.loc[ts].id.dropna().values 
    # Drop the nan field from the frame 
    dp_new[ts].dropna(thresh=5, inplace=True) 
    # save the original indices 
    original_index = {'index': dp_new.loc[ts].index, 'id': dp_new.loc[ts].id.values} 
    # set the index to field id 
    dp_new[ts].set_index(['id'], inplace=True) 

    # Check if the vector ids does NOT contain ALL ZEROS 
    if np.any(ids): # Check for all zeros 

     df_past = dp_new.iloc[iloc].copy() # SCREENSHOT AT TS=5s --> ILOC = 0 
     df_past.dropna(thresh=5, inplace=True) # drop the nan rows 
     df_past.set_index(['id'], inplace=True) # set the index to field ID 


     dp_new[ts].loc[original_index['id'], new_fields] = df_past[fields].values

這隻會填有ID的== IDS領域。

該代碼能夠在大約5分鐘內運行在300 000個元素文件上。

注：我花了相當長的時間，主要是因爲如何索引面板。起初，我認爲調用這三個維度是可行的，正如大熊貓所說的那樣，但似乎並非如此。 dp_new [ts，ids，new_fields] =值不起作用。

來源

2017-05-15 14:35:13

Pandas.SHIFT多索引框架時間依賴

回答

數據定義

遍歷不同ID的

結果

相關問題