2017-03-10 123 views
4

我正在嘗試從勾選列表重新採樣/ groupby到OHLC非時間序列(範圍條,音量條等)。Pandas Groupby累積金額

第一質詢(由累積體積組):​​

原始數據DF:

symbol   utime    time price vol cc cv ttype \ 
id                   
1 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 5 120 120  R 
2 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 10 735 120  R 
3 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 20 735 120  R 
4 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 30 735 3  R 
5 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 5 735 147  R 

我需要 「GROUPBY」 的行,直到體積柱<的總和= [常數]例如500,達到500時,重新開始總結......

僞:

vol_amount = 500 

adict  = {'open': 'first', 'high':'max', 'low':'min', 'close' : 'last' } 

ohlc_vol = data.groupby(df['vol'].cumsum() <= vol_amount)['price'].agg(adict) 

ohlc['ticks_count'] = data.groupby(df['vol'].cumsum() <= vol_amount)['vol'].count() 

謝謝你的幫助!

回答

0

如何醇」‘cumsum’絕招:

import pandas as pd 
def grouper(df, threshold=500): 
    df['new_bin'] = 0 
    cum_vol = 0 
    for i in df.index: 
     if cum_vol >= threshold: 
      df.loc[i, 'new_bin'] = 1 
      cum_vol = 0 
     cum_vol += df.loc[i, "vol"] 
    df['group'] = df['new_bin'].cumsum() 
    return df.drop("new_bin", axis=1) 


df = pd.DataFrame({"price" : [1, 2, 3, 2, 4, 5, 6, 7, 6], 
        "vol": [100, 300, 101, 100, 402, 103, 300, 100, 30]}) 

df = grouper(df) 
print df 

    price vol group 
0  1 100  0 
1  2 300  0 
2  3 101  0 
3  2 100  1 
4  4 402  1 
5  5 103  2 
6  6 300  2 
7  7 100  2 
8  6 30  3 


adict = {'open': 'first', 'high':'max', 'low':'min', 'close' : 'last' } 
print df.groupby("group")['price'].agg(adict) 

     high close open low 
group       
0   3  3  1 1 
1   4  4  2 2 
2   7  7  5 5 
3   6  6  6 6 

print df.groupby("group")['vol'].sum() 

group 
0 501 
1 502 
2 503 
3  30 
1

考慮使用雙斜槓,//的整數除法運算符的vol_amount倍數分裂的累計總和。然後使用在價格分組彙總:

from io import StringIO 
import pandas as pd 

text = ''' 
id symbol   utime    time price vol cc cv ttype 
1 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 5 120 120  R 
2 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 10 735 120  R 
3 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 20 735 120  R 
4 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 30 735 3  R 
5 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 5 735 147  R 
''' 

data = pd.concat([pd.read_table(StringIO(text), sep="\s+"), 
        pd.read_table(StringIO(text), sep="\s+"), 
        pd.read_table(StringIO(text), sep="\s+"), 
        pd.read_table(StringIO(text), sep="\s+"), 
        pd.read_table(StringIO(text), sep="\s+")]) 

# RANDOMIZE PRICE FOR DEMO 
from random import randint, seed 
seed(a=48) 
data['price'] = [float(randint(3175,3199)) for i in range(25)] 

# VOLUME CUMULATIVE GROUP 
vol_amount = 100 
data['volcum'] = data['vol'].cumsum() 
data['volcumgrp'] = data['volcum'] - ((data['volcum'] // vol_amount) * vol_amount) 
print(data) 

# PRICE AGGREGATION 
adict = {'open': 'first', 'high':'max', 'low':'min', 'close' : 'last'} 
ohlc_vol = data.groupby(['volcumgrp'])['price'].agg(adict) 
ohlc_vol['ticks_count'] = data.groupby(['volcumgrp'])['vol'].count() 
print(ohlc_vol) 

輸出

數據DF

vol_amount = 100 

data['volcum'] = data['vol'].cumsum() 
data['volcumgrp'] = data['volcum'] - ((data['volcum'] // vol_amount) * vol_amount) 


adict = {'open': 'first', 'high':'max', 'low':'min', 'close' : 'last'} 

ohlc_vol = data.groupby(['volcumgrp'])['price'].agg(adict) 
ohlc_vol['ticks_count'] = data.groupby(['volcumgrp'])['vol'].count() 

要使用發佈數據幀的重複堆棧數據表明(每100個組中英格斯)

 id  symbol  utime  time price vol cc cv ttype volcum volcumgrp 
1 DOLX16 1.476961e+09 2016-10-20 09:00:37 3192.0 5 120 120  R  5   5 
2 DOLX16 1.476961e+09 2016-10-20 09:00:37 3185.0 10 735 120  R  15   15 
3 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 20 735 120  R  35   35 
4 DOLX16 1.476961e+09 2016-10-20 09:00:37 3192.0 30 735 3  R  65   65 
5 DOLX16 1.476961e+09 2016-10-20 09:00:37 3197.0 5 735 147  R  70   70 
1 DOLX16 1.476961e+09 2016-10-20 09:00:37 3192.0 5 120 120  R  75   75 
2 DOLX16 1.476961e+09 2016-10-20 09:00:37 3184.0 10 735 120  R  85   85 
3 DOLX16 1.476961e+09 2016-10-20 09:00:37 3191.0 20 735 120  R  105   5 
4 DOLX16 1.476961e+09 2016-10-20 09:00:37 3181.0 30 735 3  R  135   35 
5 DOLX16 1.476961e+09 2016-10-20 09:00:37 3197.0 5 735 147  R  140   40 
1 DOLX16 1.476961e+09 2016-10-20 09:00:37 3199.0 5 120 120  R  145   45 
2 DOLX16 1.476961e+09 2016-10-20 09:00:37 3188.0 10 735 120  R  155   55 
3 DOLX16 1.476961e+09 2016-10-20 09:00:37 3180.0 20 735 120  R  175   75 
4 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 30 735 3  R  205   5 
5 DOLX16 1.476961e+09 2016-10-20 09:00:37 3196.0 5 735 147  R  210   10 
1 DOLX16 1.476961e+09 2016-10-20 09:00:37 3178.0 5 120 120  R  215   15 
2 DOLX16 1.476961e+09 2016-10-20 09:00:37 3190.0 10 735 120  R  225   25 
3 DOLX16 1.476961e+09 2016-10-20 09:00:37 3195.0 20 735 120  R  245   45 
4 DOLX16 1.476961e+09 2016-10-20 09:00:37 3182.0 30 735 3  R  275   75 
5 DOLX16 1.476961e+09 2016-10-20 09:00:37 3181.0 5 735 147  R  280   80 
1 DOLX16 1.476961e+09 2016-10-20 09:00:37 3199.0 5 120 120  R  285   85 
2 DOLX16 1.476961e+09 2016-10-20 09:00:37 3191.0 10 735 120  R  295   95 
3 DOLX16 1.476961e+09 2016-10-20 09:00:37 3192.0 20 735 120  R  315   15 
4 DOLX16 1.476961e+09 2016-10-20 09:00:37 3191.0 30 735 3  R  345   45 
5 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 5 735 147  R  350   50 

ohlc_vol DF

   open  low high close ticks_count 
volcumgrp            
5   3192.0 3179.0 3192.0 3179.0   3 
10   3196.0 3196.0 3196.0 3196.0   1 
15   3185.0 3178.0 3192.0 3192.0   3 
25   3190.0 3190.0 3190.0 3190.0   1 
35   3179.0 3179.0 3181.0 3181.0   2 
40   3197.0 3197.0 3197.0 3197.0   1 
45   3199.0 3191.0 3199.0 3191.0   3 
50   3179.0 3179.0 3179.0 3179.0   1 
55   3188.0 3188.0 3188.0 3188.0   1 
65   3192.0 3192.0 3192.0 3192.0   1 
70   3197.0 3197.0 3197.0 3197.0   1 
75   3192.0 3180.0 3192.0 3182.0   3 
80   3181.0 3181.0 3181.0 3181.0   1 
85   3184.0 3184.0 3199.0 3199.0   2 
95   3191.0 3191.0 3191.0 3191.0   1 
+0

我不認爲這是什麼OP都想@Parfait – dataflow

+0

希望,OP和未來的讀者不會混淆與數據演示,其中100被用來代替500,並且來自OP的帖子的重複數據被用來獲得足夠的數據。 * vol_amount *和實際*數據*都可以輕鬆更改。 – Parfait