2016-06-30 94 views
2

我試圖找出股票將從給定日期變爲未來n日期的情況。唯一的問題是,在1000行數據上運行需要大約一分鐘的時間,而且我有數百萬行。我認爲「滯後」是由線造成的:如何優化更改熊貓數據框列中的值

stocks[0][i][string][line[index]] = adjPctChange(line[adjClose],line[num])

我認爲的500只股票全3D數據幀可能被複制每次該行被擊中什麼的時候,但我只是不確定,或知道如何使其更快。之前和應用的百分比變化後

daysForeward = 2 
for days in range(1,daysForeward+1): 
    string = 'closeShift'+str(days) 
    stocks[0][i][string] = stocks[0][i]['adjClose'].shift(days-(days*2)) 

for line in stocks[0][i].itertuples(): 
    num = 6 #first closeShift columnb 
    for days in range(1,daysForeward+1): 
     string = 'closeShift'+str(days) 
     stocks[0][i][string][line[index]] = adjPctChange(line[adjClose],line[num]) 
     num+=1 

這裏的數據:

 date  open close adjClose closeShift1 closeShift2 
0 19980102 20.3835 20.4417  NaN   NaN  0.984507 
1 19980105 20.5097 20.5679  NaN  0.984507  1.034904 
2 19980106 20.1408 20.0826 0.984507  1.034904  0.994047 
3 19980107 20.1408 20.9950 1.034904  0.994047  0.982926 
4 19980108 21.1115 20.0244 0.994047  0.982926  0.989441 

     date  open close adjClose closeShift1 closeShift2 
0 19980102 20.3835 20.4417  NaN   NaN   NaN 
1 19980105 20.5097 20.5679  NaN   NaN   NaN 
2 19980106 20.1408 20.0826 0.984507  4.869735  0.959720 
3 19980107 20.1408 20.9950 1.034904 -3.947904 -5.022423 
4 19980108 21.1115 20.0244 0.994047 -1.118683 -0.463311 

幾點說明:另外,它拋出這樣的警告:

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

這裏是我的代碼

[0] in stocks[0][i]只是爲了在3d數據框中達到適當的水平,[i]用於股票中的股票名稱,這些股票是在更高的循環中迭代通過的。

adjClose列僅僅是close的一個修改版本,這是我更喜歡使用的而不是close

adjPctChange()是一個自定義百分比更改函數,它可以切換方程式,使得100到50會產生與50到100相同的結果,因此結果可以平均並且不會向上傾斜。

def adjPctChange(startPoint, currentPoint): 
    if startPoint < currentPoint: 
     x = abs(((float(startPoint)-currentPoint)/float(currentPoint))*100.0) 
    else: 
     x = ((float(currentPoint)-startPoint)/float(startPoint))*100.0  
    return x 

感謝任何能夠幫助到的人!

回答

2

您不應該遍歷DataFrame;只需使用數組函數完成所有任務

前:

In [30]: df 
Out[30]: 
     date  open close adjClose closeShift1 closeShift2 
0 19980102 20.3835 20.4417  NaN   NaN  0.984507 
1 19980105 20.5097 20.5679  NaN  0.984507  1.034904 
2 19980106 20.1408 20.0826 0.984507  1.034904  0.994047 
3 19980107 20.1408 20.9950 1.034904  0.994047  0.982926 
4 19980108 21.1115 20.0244 0.994047  0.982926  0.989441 

陣列的符號:

daysForeward = 2 
for day in range(1, daysForeward+1): 
    column = 'closeShift' + str(day) 
    df[column] = (df[column] - df.adjClose)/np.maximum(df[column], df.adjClose) * 100.0 

後:

In [33]: df 
Out[33]: 
     date  open close adjClose closeShift1 closeShift2 
0 19980102 20.3835 20.4417  NaN   NaN   NaN 
1 19980105 20.5097 20.5679  NaN   NaN   NaN 
2 19980106 20.1408 20.0826 0.984507  4.869727  0.959713 
3 19980107 20.1408 20.9950 1.034904 -3.947902 -5.022495 
4 19980108 21.1115 20.0244 0.994047 -1.118760 -0.463358 
0

IIUC:

我開始了與這個數據幀:

print df 

     date  open close adjclose 
0 19980102 20.3835 20.4417 0.984507 
1 19980105 20.5097 20.5679 1.034904 
2 19980106 20.1408 20.0826 0.994047 
3 19980107 20.1408 20.9950 0.982926 
4 19980108 21.1115 20.0244 0.989441 

然後,我創建了這些功能:

def get_lags(s, n): 
    return pd.concat([s.shift(i) for i in range(n + 1)], 
        axis=1, keys=range(n + 1)) 

def get_comps(lags): 
    comps = [] 
    for i, cni in enumerate(lags.columns): 
     if i > 0: 
      max_ = lags.iloc[:, [0, i]].max(1) 
      min_ = lags.iloc[:, [0, i]].min(1) 
      comps.append((max_/min_ - 1) * 100) 
    return pd.concat(comps, axis=1) 

然後我得到的滯後和對它們進行比較:

print get_comps(get_lags(df.adjclose, 2)) 



      0   1 
0 0.000000 0.000000 
1 5.119009 0.000000 
2 4.110168 0.969013 
3 1.131418 5.288089 
4 0.662817 0.465515 

Fina lly,我將它們與df連接起來

print pd.concat([df, get_comps(get_lags(df.adjclose, 2))], axis=1) 

     date  open close adjclose   0   1 
0 19980102 20.3835 20.4417 0.984507 0.000000 0.000000 
1 19980105 20.5097 20.5679 1.034904 5.119009 0.000000 
2 19980106 20.1408 20.0826 0.994047 4.110168 0.969013 
3 19980107 20.1408 20.9950 0.982926 1.131418 5.288089 
4 19980108 21.1115 20.0244 0.989441 0.662817 0.465515 

根據需要進行修改。

相關問題