2016-01-03 127 views
3

創建DatetimeIndex一個數據幀具有唯一值我有一個數據幀df通過添加timedelta

     Col1 
Date 
2015-01-01 00:00:00  1 
2015-01-01 00:00:01  1 
2015-01-01 00:00:01  1 
2015-01-01 00:00:01  1 
2015-01-01 00:00:02  1 
2015-01-01 00:00:04  1 
2015-01-01 00:00:04  1 
2015-01-01 00:00:06  1 
2015-01-01 00:00:07  1 
2015-01-01 00:00:07  1 

它是使用創建:

import pandas as pd 
from cStringIO import StringIO 

dat = """Date,Col1 
2015-01-01 00:00:00,1 
2015-01-01 00:00:01,1 
2015-01-01 00:00:01,1 
2015-01-01 00:00:01,1 
2015-01-01 00:00:02,1 
2015-01-01 00:00:04,1 
2015-01-01 00:00:04,1 
2015-01-01 00:00:06,1 
2015-01-01 00:00:07,1 
2015-01-01 00:00:07,1""" 

df = pd.read_csv(StringIO(dat)) 
df['Date'] = pd.to_datetime(df['Date']) 
df = df.set_index('Date') 

該數據幀不具有唯一索引

>>> df.index.is_unique 
False 

我想通過添加1毫秒(或更少)來構建唯一索引得到這樣

      Col1 
Date 
2015-01-01 00:00:00.000  1 
2015-01-01 00:00:01.000  1 
2015-01-01 00:00:01.001  1 
2015-01-01 00:00:01.002  1 
2015-01-01 00:00:02.000  1 
2015-01-01 00:00:04.000  1 
2015-01-01 00:00:04.001  1 
2015-01-01 00:00:06.000  1 
2015-01-01 00:00:07.000  1 
2015-01-01 00:00:07.001  1 

我正在尋找一個矢量的解決方案(不循環),因爲我有很多的數據處理

+0

我發現你可能需要在[官方文檔]一切(http://pandas.pydata.org/pandas- docs/stable/io.html#日期處理) –

回答

2

您可以groupby區別轉移和原始列Date之間,得到的數他們通過cumsum,由他們計數cumcount並轉換爲納秒。

納秒(1E-9)是爲毫秒(1E-3)更好,因爲使用毫秒可以創建新的口是心非行,但不具有毫微秒(原始數據採用毫秒 - 0 2015-11-02 00:00:01.072 EUR/USD 1.10294 1.10296)。

df = df.reset_index() 
#create ms column 
df['Date'] = df['Date'] + (df['Date'].groupby((df['Date'] != df['Date'].shift()).cumsum()) 
             .cumcount()).values.astype('timedelta64[ns]') 
print df 

          Date Col1 
0 2015-01-01 00:00:00.000000000  1 
1 2015-01-01 00:00:01.000000000  1 
2 2015-01-01 00:00:01.000000001  1 
3 2015-01-01 00:00:01.000000002  1 
4 2015-01-01 00:00:02.000000000  1 
5 2015-01-01 00:00:04.000000000  1 
6 2015-01-01 00:00:04.000000001  1 
7 2015-01-01 00:00:06.000000000  1 
8 2015-01-01 00:00:07.000000000  1 
9 2015-01-01 00:00:07.000000001  1 

#set column Date as index 
df = df.set_index('Date') 

最快溶液使用納秒並且如果表裏不一數據的最大長度小於作爲10000001E6)都可以使用。

因此,如果您使用csv3898069 rows),首先檢查這個長度,如果DF的行是爲1E6更高:

import pandas as pd 

df = pd.read_csv('test/EURUSD-2015-11.csv', header=None, parse_dates=[1], 
        names =['eurusd','Date','a','b'], sep=",") 

#sort values if not sorted 
df = df.sort_values('Date') 
print df.head() 
print df[df['Date'] == df['Date'].shift()] 
      eurusd     Date  a  b 
1996  EUR/USD 2015-11-02 00:51:18.198 1.10323 1.10327 
2944  EUR/USD 2015-11-02 01:00:03.844 1.10321 1.10326 
6450  EUR/USD 2015-11-02 01:37:35.898 1.10319 1.10324 
11429 EUR/USD 2015-11-02 02:24:29.945 1.10301 1.10306 
19468 EUR/USD 2015-11-02 03:13:40.575 1.10326 1.10333 
20074 EUR/USD 2015-11-02 03:17:03.607 1.10282 1.10288 
36618 EUR/USD 2015-11-02 04:36:01.357 1.10213 1.10217 
40235 EUR/USD 2015-11-02 04:49:05.946 1.10075 1.10082 
42930 EUR/USD 2015-11-02 05:01:37.955 1.10034 1.10042 
43269 EUR/USD 2015-11-02 05:03:21.360 1.10070 1.10073 
47043 EUR/USD 2015-11-02 05:22:59.811 1.10142 1.10149 
47526 EUR/USD 2015-11-02 05:25:45.474 1.10143 1.10150 
53398 EUR/USD 2015-11-02 05:58:23.674 1.10294 1.10299 
59899 EUR/USD 2015-11-02 06:44:55.266 1.10145 1.10150 
64480 EUR/USD 2015-11-02 07:30:27.091 1.10211 1.10217 
70576 EUR/USD 2015-11-02 08:14:04.318 1.10329 1.10336 
75662 EUR/USD 2015-11-02 08:54:35.138 1.10485 1.10486 
75724 EUR/USD 2015-11-02 08:55:00.577 1.10504 1.10507 
93917 EUR/USD 2015-11-02 10:55:20.863 1.10345 1.10349 
94603 EUR/USD 2015-11-02 10:57:56.289 1.10352 1.10356 
98046 EUR/USD 2015-11-02 11:16:24.127 1.10272 1.10278 
98433 EUR/USD 2015-11-02 11:19:14.109 1.10281 1.10286 
100582 EUR/USD 2015-11-02 11:31:57.891 1.10247 1.10252 
105627 EUR/USD 2015-11-02 12:11:01.900 1.10243 1.10246 
106789 EUR/USD 2015-11-02 12:19:45.974 1.10183 1.10190 
115219 EUR/USD 2015-11-02 14:06:47.229 1.10194 1.10200 
116808 EUR/USD 2015-11-02 14:35:50.693 1.10204 1.10211 
124436 EUR/USD 2015-11-02 17:06:48.286 1.10125 1.10144 
124532 EUR/USD 2015-11-02 17:07:56.048 1.10160 1.10174 
124734 EUR/USD 2015-11-02 17:11:51.609 1.1.10142 
...   ...      ...  ...  ... 
3893816 EUR/USD 2015-11-30 20:59:38.304 1.05651 1.05655 
3893818 EUR/USD 2015-11-30 20:59:39.341 1.05650 1.05653 
3893819 EUR/USD 2015-11-30 20:59:39.976 1.05651 1.05653 
3893820 EUR/USD 2015-11-30 20:59:45.170 1.05652 1.05653 
3895397 EUR/USD 2015-11-30 20:59:51.605 1.05654 1.05658 
3895398 EUR/USD 2015-11-30 20:59:51.707 1.05655 1.05659 
3893838 EUR/USD 2015-11-30 20:59:51.767 1.05656 1.05657 
3893841 EUR/USD 2015-11-30 20:59:51.816 1.05658 1.05662 
3895401 EUR/USD 2015-11-30 20:59:52.073 1.05659 1.05663 
3895402 EUR/USD 2015-11-30 20:59:52.229 1.05660 1.05664 
3893847 EUR/USD 2015-11-30 20:59:52.818 1.05659 1.05663 
3895404 EUR/USD 2015-11-30 20:59:52.915 1.05660 1.05664 
3893852 EUR/USD 2015-11-30 20:59:53.106 1.05661 1.05662 
3893855 EUR/USD 2015-11-30 20:59:57.031 1.05662 1.05664 
3895407 EUR/USD 2015-11-30 20:59:57.084 1.05664 1.05668 
3895416 EUR/USD 2015-11-30 21:00:00.816 1.05664 1.05665 
3895718 EUR/USD 2015-11-30 21:05:45.605 1.05666 1.05670 
3895857 EUR/USD 2015-11-30 21:12:38.965 1.05659 1.05663 
3895866 EUR/USD 2015-11-30 21:12:44.505 1.05666 1.05666 
3895899 EUR/USD 2015-11-30 21:13:07.805 1.05669 1.05673 
3895931 EUR/USD 2015-11-30 21:13:55.007 1.05675 1.05677 
3896093 EUR/USD 2015-11-30 21:25:27.988 1.05658 1.05663 
3896097 EUR/USD 2015-11-30 21:25:28.002 1.05661 1.05665 
3896209 EUR/USD 2015-11-30 21:28:25.906 1.05655 1.05660 
3896307 EUR/USD 2015-11-30 21:32:32.490 1.05653 1.05658 
3896342 EUR/USD 2015-11-30 21:35:40.005 1.05654 1.05660 
3896393 EUR/USD 2015-11-30 21:40:40.182 1.05648 1.05652 
3896849 EUR/USD 2015-11-30 22:19:34.582 1.05670 1.05684 
3897168 EUR/USD 2015-11-30 22:40:27.108 1.05675 1.05686 
3897389 EUR/USD 2015-11-30 22:50:46.825 1.05705 1.05717 

[35636 rows x 4 columns] 
print len(df[df['Date'] == df['Date'].shift()]) 
35636 

所以35636少爲1000000然後你可以計算這個獨特的行到999999

df.loc[df['Date'] == df['Date'].shift(), 'Date'] = 
        df['Date'] + 
        ((df['Date'] == df['Date'].shift()).cumsum()).astype('timedelta64[ns]') 

print df 

          Date Col1 
0 2015-01-01 00:00:00.000000000  1 
1 2015-01-01 00:00:01.000000000  1 
2 2015-01-01 00:00:01.000000001  1 
3 2015-01-01 00:00:01.000000002  1 
4 2015-01-01 00:00:02.000000000  1 
5 2015-01-01 00:00:04.000000000  1 
6 2015-01-01 00:00:04.000000003  1 
7 2015-01-01 00:00:06.000000000  1 
8 2015-01-01 00:00:07.000000000  1 
9 2015-01-01 00:00:07.000000004  1 

. 
. 
. 
99945 2015-01-01 23:59:09.000999999  1 

比較:

import pandas as pd 

df = pd.read_csv('test/EURUSD-2015-11.csv', header=None, parse_dates=[1], 
       names =['eurusd','Date','a','b'], sep=",") 

#sort values if not sorted 
df = df.sort_values('Date') 
print df.head() 

#print df[df['Date'] == df['Date'].shift()] 
#print len(df[df['Date'] == df['Date'].shift()]) 

df3 = df.copy() 

def ori(df): 
    df['Date']=df['Date']+(df['Date'].groupby((df['Date'] != df['Date'].shift()) 
            .cumsum()).cumcount()).values.astype('timedelta64[ns]') 
    return df 


def new(df): 
    df.loc[df['Date'] == df['Date'].shift(), 'Date'] = df['Date'] + 
    ((df['Date'] == df['Date'].shift()).cumsum()).astype('timedelta64[ns]') 

    return df  

df1 = ori(df) 
df2 = new(df3) 


print df1.head() 
print df2.head() 

時機較好:

In [81]: %timeit ori(df) 
1 loops, best of 3: 2min 22s per loop 
Compiler time: 0.10 s 

In [82]: %timeit new(df) 
1 loops, best of 3: 758 ms per loop 
+0

謝謝,但是處理https://drive.google.com/file/d/0B8iUtWjZOTqlZnFERGFhNWtCenc/view?usp=sharing 3898069行的速度非常慢(約1分鐘)。任何想法來改善它? –

+0

它是如何工作的?也許你可以接受答案。謝謝。 – jezrael