2017-04-07 40 views
0

我有一個長時間(> 1年)的日期時間索引爲30分鐘的df,所以> 17520行。由於與夏令時相關的原因,索引中的兩個索引值會重複出現,並且缺少兩個值。因此,重複的值是:在大熊貓數據框的datetimeindex中移位值

In[1]: df[df.index.duplicated('first')] 
Out[2]: 
          a   b  c 
timestamp                 
2012-10-07 01:00:00   NaN  NaN  NaN  
2012-10-07 01:30:00   NaN  NaN  NaN  
2013-10-06 01:00:00   NaN  NaN  NaN  
2013-10-06 01:30:00   NaN  NaN  NaN  

我想這些更改爲缺失值,1個小時後:

In[3]: df[df.index.duplicated('first')].shift(1,freq="H") 
Out[4]: 
          a   b  c 
timestamp                 
2012-10-07 02:00:00   NaN  NaN  NaN  
2012-10-07 02:30:00   NaN  NaN  NaN  
2013-10-06 02:00:00   NaN  NaN  NaN   
2013-10-06 02:30:00   NaN  NaN  NaN 

但這並不能改變指數:

df[df.index.duplicated('first')] = df[df.index.duplicated('first')].shift(1,freq="H") 

什麼會?

回答

0

我想你需要通過dict映射duplicated indexrename

print (df) 
        a b c 
timestamp      
2013-10-06 01:00:00 1 NaN NaN 
2013-10-06 01:30:00 2 NaN NaN 
2013-10-06 01:00:00 3 NaN NaN 
2013-10-06 01:30:00 4 NaN NaN 
2012-10-08 01:30:00 5 NaN NaN 
2013-10-10 01:00:00 6 NaN NaN 


df1 = df[df.index.duplicated('first')] 
d = dict(zip(df1.index, df1.shift(1,freq="H").index)) 
print (d) 
{Timestamp('2013-10-06 01:00:00'): Timestamp('2013-10-06 02:00:00'), 
Timestamp('2013-10-06 01:30:00'): Timestamp('2013-10-06 02:30:00')} 

df = df.rename(index=d) 
print (df) 
        a b c 
timestamp      
2013-10-06 02:00:00 1 NaN NaN 
2013-10-06 02:30:00 2 NaN NaN 
2013-10-06 02:00:00 3 NaN NaN 
2013-10-06 02:30:00 4 NaN NaN 
2012-10-08 01:30:00 5 NaN NaN 
2013-10-10 01:00:00 6 NaN NaN 

類似的解決方案:

idx = df.index[df.index.duplicated('first')] 
d = dict(zip(idx, idx.to_series().shift(freq="H").index)) 
print (d) 
{Timestamp('2013-10-06 01:00:00'): Timestamp('2013-10-06 02:00:00'), 
Timestamp('2013-10-06 01:30:00'): Timestamp('2013-10-06 02:30:00')} 

df = df.rename(index=d) 
print (df) 
        a b c 
timestamp      
2013-10-06 02:00:00 1 NaN NaN 
2013-10-06 02:30:00 2 NaN NaN 
2013-10-06 02:00:00 3 NaN NaN 
2013-10-06 02:30:00 4 NaN NaN 
2012-10-08 01:30:00 5 NaN NaN 
2013-10-10 01:00:00 6 NaN NaN 
2013-10-06 02:30:00 8 NaN NaN 
2012-10-08 01:30:00 9 NaN NaN 
2013-10-10 01:00:00 10 NaN NaN 

idx = df.index[df.index.duplicated('first')] 
s = idx.to_series().shift(freq="H") 
#swap index with values in Series 
d = pd.Series(s.index.values, index = s.values).to_dict() 
print (d) 
{Timestamp('2013-10-06 01:00:00'): Timestamp('2013-10-06 02:00:00'), 
Timestamp('2013-10-06 01:30:00'): Timestamp('2013-10-06 02:30:00')} 

df = df.rename(index=d) 
print (df) 
        a b c 
timestamp      
2013-10-06 02:00:00 1 NaN NaN 
2013-10-06 02:30:00 2 NaN NaN 
2013-10-06 02:00:00 3 NaN NaN 
2013-10-06 02:30:00 4 NaN NaN 
2012-10-08 01:30:00 5 NaN NaN 
2013-10-10 01:00:00 6 NaN NaN 

EDIT1:

你需要添加timedeltas創建cumcountto_timedelta原始索引。

delta = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='H') 
print (delta) 
timestamp 
2013-10-06 01:00:00 00:00:00 
2013-10-06 01:30:00 00:00:00 
2013-10-06 01:00:00 01:00:00 
2013-10-06 01:30:00 01:00:00 
2012-10-08 01:30:00 00:00:00 
2013-10-10 01:00:00 00:00:00 
dtype: timedelta64[ns] 

df.index = df.index + delta 
print (df) 
        a b c 
2013-10-06 01:00:00 1 NaN NaN 
2013-10-06 01:30:00 2 NaN NaN 
2013-10-06 02:00:00 3 NaN NaN 
2013-10-06 02:30:00 4 NaN NaN 
2012-10-08 01:30:00 5 NaN NaN 
2013-10-10 01:00:00 6 NaN NaN 
+0

不。第一個建議給出df1僅包含更改的時間戳(而不包括一年的其餘時間)。第二個建議轉移df中的每個時間戳,而不僅僅是重複的。 – doctorer

+0

謝謝。幾乎那裏,但不完全。這已重命名了重複值的兩個實例,所以我現在重複了'2012-10-07 02:00:00'等。我只想重命名每個時間戳的_second_實例 – doctorer

+0

您能解釋原因嗎? – jezrael