2013-05-09 19 views
2

我有這樣填寫楠燭臺OHLCV數據

     OPEN HIGH  LOW CLOSE   VOL 
2012-01-01 19:00:00 449000 449000 449000 449000 1336303000 
2012-01-01 20:00:00  NaN  NaN  NaN  NaN   NaN 
2012-01-01 21:00:00  NaN  NaN  NaN  NaN   NaN 
2012-01-01 22:00:00  NaN  NaN  NaN  NaN   NaN 
2012-01-01 23:00:00  NaN  NaN  NaN  NaN   NaN 
... 
         OPEN  HIGH  LOW  CLOSE   VOL 
2013-04-24 14:00:00 11700000 12000000 11600000 12000000 20647095439 
2013-04-24 15:00:00 12000000 12399000 11979000 12399000 23997107870 
2013-04-24 16:00:00 12399000 12400000 11865000 12100000 9379191474 
2013-04-24 17:00:00 12300000 12397995 11850000 11850000 4281521826 
2013-04-24 18:00:00 11850000 11850000 10903000 11800000 15546034128 

我需要根據這個規則,以填補NaN一個數據幀

當開盤價,最高價,最低價,收盤價爲NaN,

  • 將VOL設置爲0
  • 設置OPEN,HIGH,LOW,CLOSE to previous CLOSE蠟燭值

否則保持NaN的

回答

0

這裏是如何做到這一點通過屏蔽

模擬與框架一些孔(A是你的 '親密' 字段)

In [20]: df = DataFrame(randn(10,3),index=date_range('20130101',periods=10,freq='min'), 
      columns=list('ABC')) 

In [21]: df.iloc[1:3,:] = np.nan 

In [22]: df.iloc[5:8,1:3] = np.nan 

In [23]: df 
Out[23]: 
          A   B   C 
2013-01-01 00:00:00 -0.486149 0.156894 -0.272362 
2013-01-01 00:01:00  NaN  NaN  NaN 
2013-01-01 00:02:00  NaN  NaN  NaN 
2013-01-01 00:03:00 1.788240 -0.593195 0.059606 
2013-01-01 00:04:00 1.097781 0.835491 -0.855468 
2013-01-01 00:05:00 0.753991  NaN  NaN 
2013-01-01 00:06:00 -0.456790  NaN  NaN 
2013-01-01 00:07:00 -0.479704  NaN  NaN 
2013-01-01 00:08:00 1.332830 1.276571 -0.480007 
2013-01-01 00:09:00 -0.759806 -0.815984 2.699401 

那些我們認爲是所有南

In [24]: mask_0 = pd.isnull(df).all(axis=1) 

In [25]: mask_0 
Out[25]: 
2013-01-01 00:00:00 False 
2013-01-01 00:01:00  True 
2013-01-01 00:02:00  True 
2013-01-01 00:03:00 False 
2013-01-01 00:04:00 False 
2013-01-01 00:05:00 False 
2013-01-01 00:06:00 False 
2013-01-01 00:07:00 False 
2013-01-01 00:08:00 False 
2013-01-01 00:09:00 False 
Freq: T, dtype: bool 

那些我們想傳播完成一個

In [26]: mask_fill = pd.isnull(df['B']) & pd.isnull(df['C']) 

In [27]: mask_fill 
Out[27]: 
2013-01-01 00:00:00 False 
2013-01-01 00:01:00  True 
2013-01-01 00:02:00  True 
2013-01-01 00:03:00 False 
2013-01-01 00:04:00 False 
2013-01-01 00:05:00  True 
2013-01-01 00:06:00  True 
2013-01-01 00:07:00  True 
2013-01-01 00:08:00 False 
2013-01-01 00:09:00 False 
Freq: T, dtype: bool 

傳播完成第一

In [28]: df.loc[mask_fill,'C'] = df['A'] 

In [29]: df.loc[mask_fill,'B'] = df['A'] 

填0的

In [30]: df.loc[mask_0] = 0 

完成

In [31]: df 
Out[31]: 
          A   B   C 
2013-01-01 00:00:00 -0.486149 0.156894 -0.272362 
2013-01-01 00:01:00 0.000000 0.000000 0.000000 
2013-01-01 00:02:00 0.000000 0.000000 0.000000 
2013-01-01 00:03:00 1.788240 -0.593195 0.059606 
2013-01-01 00:04:00 1.097781 0.835491 -0.855468 
2013-01-01 00:05:00 0.753991 0.753991 0.753991 
2013-01-01 00:06:00 -0.456790 -0.456790 -0.456790 
2013-01-01 00:07:00 -0.479704 -0.479704 -0.479704 
2013-01-01 00:08:00 1.332830 1.276571 -0.480007 
2013-01-01 00:09:00 -0.759806 -0.815984 2.699401 
+0

不回覆作者的評論 – MattClimbs 2017-12-28 22:54:41

0

This說明​​丟失的數據的行爲。你要找的咒語是fillna方法,它的值:

In [1381]: df2 
Out[1381]: 
     one  two  three four five   timestamp 
a  NaN 1.138469 -2.400634 bar True     NaT 
c  NaN 0.025653 -1.386071 bar False     NaT 
e 0.863937 0.252462 1.500571 bar True 2012-01-01 00:00:00 
f 1.053202 -2.338595 -0.374279 bar True 2012-01-01 00:00:00 
h  NaN -1.157886 -0.551865 bar False     NaT 

In [1382]: df2.fillna(0) 
Out[1382]: 
     one  two  three four five   timestamp 
a 0.000000 1.138469 -2.400634 bar True 1970-01-01 00:00:00 
c 0.000000 0.025653 -1.386071 bar False 1970-01-01 00:00:00 
e 0.863937 0.252462 1.500571 bar True 2012-01-01 00:00:00 
f 1.053202 -2.338595 -0.374279 bar True 2012-01-01 00:00:00 
h 0.000000 -1.157886 -0.551865 bar False 1970-01-01 00:00:00 

你甚至可以向前傳播完成他們落後:

In [1384]: df 
Out[1384]: 
     one  two  three 
a  NaN 1.138469 -2.400634 
c  NaN 0.025653 -1.386071 
e 0.863937 0.252462 1.500571 
f 1.053202 -2.338595 -0.374279 
h  NaN -1.157886 -0.551865 

In [1385]: df.fillna(method='pad') 
Out[1385]: 
     one  two  three 
a  NaN 1.138469 -2.400634 
c  NaN 0.025653 -1.386071 
e 0.863937 0.252462 1.500571 
f 1.053202 -2.338595 -0.374279 
h 1.053202 -1.157886 -0.551865 

針對您的特殊情況下,我想你會需要做的:

df['VOL'].fillna(0) 
df.fillna(df['CLOSE']) 
+0

對於音量它是'df ['VOL'] = df ['VOL']。fillna(0)'但'df = df.fillna(df ['CLOSE'])'不起作用 – working4coins 2013-05-09 16:51:20

+0

我做了這個'DF [ 'VOL'] = DF [ 'VOL']。fillna(0) DF [ 'CLOSE'] = DF [ 'CLOSE']。fillna() DF [ 'OPEN'] = DF [ 'OPEN' ] .fillna(DF [ 'CLOSE']) DF [ 'LOW'] = DF [ 'LOW']。fillna(DF [ 'CLOSE']) DF [ 'HIGH'] = DF [ '高'。 fillna(df ['CLOSE'])'self.dataframe – working4coins 2013-05-09 17:46:01

1

因爲無論是其他兩個答案的工作,這裏有一個完整的答案。

我在這裏測試兩種方法。第一個是基於working4coin對hd1答案的評論,第二個是一個較慢的純python實現。似乎很明顯,python實現應該會更慢,但我決定使用這兩種方法來確保並量化結果。方法1在c(在熊貓代碼中)完成大部分的繁重工作,所以應該是相當快的。

緩慢,蟒的方法(方法2)如下所示

def nans_to_prev_close_method2(data_frame): 
    prev_row = None 
    for index, row in data_frame.iterrows(): 
     if np.isnan(row['open']): # row.isnull().any(): 
      pclose = prev_row['close'] 
      # assumes first row has no nulls!! 
      row['open'] = pclose 
      row['high'] = pclose 
      row['low'] = pclose 
      row['close'] = pclose 
      row['volume'] = 0.0 
     prev_row = row 

測試的定時上兩者:

df = trades_to_ohlcv(PATH_TO_RAW_TRADES_CSV, '1s') # splits raw trades into secondly candles 
df2 = df.copy() 

wrapped1 = wrapper(nans_to_prev_close_method1, df) 
wrapped2 = wrapper(nans_to_prev_close_method2, df2) 

print("method 1: %.2f sec" % timeit.timeit(wrapped1, number=1)) 
print("method 2: %.2f sec" % timeit.timeit(wrapped2, number=1)) 

結果爲:

method 1: 0.46 sec 
method 2: 151.82 sec 

顯然方法1要快得多(約快330倍)。