2016-07-27 55 views
0

我想使用時間日期時間作爲主索引,但在那裏有很多重複項。我想要的是在每組秒內添加人工毫秒,用作「計數器」。通過添加毫秒去重複時間索引

例如 - 原始數據框的樣子:

      Bid BidVol 
2016-06-27 13:00:10 4183.50  0 
2016-06-27 13:00:10 4183.50  0 
2016-06-27 13:00:10 4183.50  0 
2016-06-28 13:00:10 4249.25  1 
2016-06-28 13:00:10 4249.25  1 
2016-06-28 13:00:10 4249.00  1 
2016-06-28 13:00:10 4248.75  1 
2016-06-28 13:00:10 4248.75  2 
2016-06-28 13:00:10 4248.75  1 
2016-06-28 13:00:10 4248.75  2 
2016-06-28 13:00:12 4248.50  0 
2016-06-28 13:00:12 4248.50  0 
2016-06-29 13:00:12 4353.75  0 
2016-06-29 13:00:12 4353.75  0 
2016-06-29 13:00:12 4353.75  0 
2016-06-29 13:00:12 4354.00  1 
2016-06-29 13:00:12 4354.00  1 
2016-06-29 13:00:12 4353.75  0 
2016-06-29 13:00:12 4354.00  1 
2016-06-29 13:00:12 4354.00  1 
2016-06-29 13:00:12 4354.00  1 
2016-06-29 13:00:12 4354.00  1 
2016-06-30 13:00:10 4394.00  0 
2016-06-30 13:00:11 4394.25  1 
2016-06-30 13:00:11 4394.00  0 

我的目標是改變duplicit行:

2016-06-28 13:00:10 
2016-06-28 13:00:10.001000 
2016-06-28 13:00:10.002000 
2016-06-28 13:00:10.003000 
2016-06-28 13:00:10.004000 
2016-06-28 13:00:10.005000 
2016-06-28 13:00:10.006000 

我試圖用GROUPBY功能的發揮,我可以用它來打印循環的毫秒數:

for name, group in test.groupby(test.index): 
    print ('------') 
    i=0 
    for idx, values in group.iterrows(): 
     print (idx+pd.Timedelta(milliseconds=i)) 
     i+=1 

但是我不知道如何改變索引最有效的方法來獲得我需要的結果?特別是考慮到效率(主數據集非常大)。

回答

2

可以使用cumcount創建ms,將其轉換to_timedelta,並添加到index

a = df.groupby(level=0).cumcount() 
print (a) 
2016-06-27 13:00:10 0 
2016-06-27 13:00:10 1 
2016-06-27 13:00:10 2 
2016-06-28 13:00:10 0 
2016-06-28 13:00:10 1 
2016-06-28 13:00:10 2 
2016-06-28 13:00:10 3 
2016-06-28 13:00:10 4 
2016-06-28 13:00:10 5 
2016-06-28 13:00:10 6 
2016-06-28 13:00:12 0 
2016-06-28 13:00:12 1 
2016-06-29 13:00:12 0 
2016-06-29 13:00:12 1 
2016-06-29 13:00:12 2 
2016-06-29 13:00:12 3 
2016-06-29 13:00:12 4 
2016-06-29 13:00:12 5 
2016-06-29 13:00:12 6 
2016-06-29 13:00:12 7 
2016-06-29 13:00:12 8 
2016-06-29 13:00:12 9 
2016-06-30 13:00:10 0 
2016-06-30 13:00:11 0 
2016-06-30 13:00:11 1 
dtype: int64 
df.index = df.index + pd.to_timedelta(a, unit='ms') 
print (df) 
          Bid BidVol 
2016-06-27 13:00:10.000 4183.50  0 
2016-06-27 13:00:10.001 4183.50  0 
2016-06-27 13:00:10.002 4183.50  0 
2016-06-28 13:00:10.000 4249.25  1 
2016-06-28 13:00:10.001 4249.25  1 
2016-06-28 13:00:10.002 4249.00  1 
2016-06-28 13:00:10.003 4248.75  1 
2016-06-28 13:00:10.004 4248.75  2 
2016-06-28 13:00:10.005 4248.75  1 
2016-06-28 13:00:10.006 4248.75  2 
2016-06-28 13:00:12.000 4248.50  0 
2016-06-28 13:00:12.001 4248.50  0 
2016-06-29 13:00:12.000 4353.75  0 
2016-06-29 13:00:12.001 4353.75  0 
2016-06-29 13:00:12.002 4353.75  0 
2016-06-29 13:00:12.003 4354.00  1 
2016-06-29 13:00:12.004 4354.00  1 
2016-06-29 13:00:12.005 4353.75  0 
2016-06-29 13:00:12.006 4354.00  1 
2016-06-29 13:00:12.007 4354.00  1 
2016-06-29 13:00:12.008 4354.00  1 
2016-06-29 13:00:12.009 4354.00  1 
2016-06-30 13:00:10.000 4394.00  0 
2016-06-30 13:00:11.000 4394.25  1 
2016-06-30 13:00:11.001 4394.00  0