2016-05-16 118 views
0

我是新的Python世界。我必須處理金融數據集。說我有一個數據幀是這樣的:python groupwise winsorization和線性迴歸

TradingDate StockCode  Size  ILLIQ 
0 20050131 000001 13.980320 77.7522 
1 20050131 000002 14.071253 19.1471 
2 20050131 000004 10.805564 696.2428 
3 20050131 000005 11.910485 621.3723 
4 20050131 000006 11.631550 339.0952 
*** *** 

我想要做的就是做一個GroupWise OLS迴歸,其中分組varibales是TradingDate,因變量是「大小」,自變量是「 ILLIQ」。我想將剩餘的迴歸項追加回原始的數據框,比如說一個名爲「殘差」的新列。我該如何處理這件事?

看來下面的代碼不工作?

def regress(data,yvar,xvars): 
    Y = data[yvar] 
    X = data[xvars] 
    X['intercept']=1. 
    result = sm.OLS(Y,X).fit() 
    return result.resid() 

by_Date = df.groupby('TradingDate') 
by_Date.apply(regress,'ILLIQ',['Size']) 

回答

0

你只需要使用.resid訪問殘差 - .resid只是一種屬性,而不是一個方法(see docs)。簡化圖解:

import statsmodels.formula.api as sm 
df = df.set_index('TradingDate', inplace=True) 
df['residuals'] = df.groupby(level=0).apply(lambda x: pd.DataFrame(sm.ols(formula="Size ~ ILLIQ", data=x).fit().resid)).values 

      StockCode  Size  ILLIQ residuals 
TradingDate           
20050131    1 13.980320 77.7522 0.299278 
20050131    2 14.071253 19.1471 0.132318 
20050131    4 10.805564 696.2428 -0.153800 
20050131    5 11.910485 621.3723 0.621652 
20050131    6 11.631550 339.0952 -0.899448 
+0

我想你的代碼,它提供了以下錯誤: ValueError異常:值的長度不符合指標 – Vincent

+0

的長度我想我先把'TradingDate'移到索引上,讓我更新答案。 – Stefan

+0

實際上,我在從SQL DB導入數據時將索引設置爲TradingDate列: df = pd.read_sql_query(query,con,index_col = ['TradingDate']) – Vincent

0

設置

from StringIO import StringIO 
import pandas as pd 

text = """TradingDate StockCode  Size  ILLIQ 
0 20050131 000001 13.980320 77.7522 
1 20050131 000002 14.071253 19.1471 
2 20050131 000004 10.805564 696.2428 
3 20050131 000005 11.910485 621.3723 
4 20050131 000006 11.631550 339.0952""" 

df = pd.read_csv(StringIO(text), delim_whitespace=1, 
       converters=dict(TradingDate=pd.to_datetime)) 

解決方案

def regress(data,yvar,xvars): 
    # I changed this a bit to ensure proper dimensional alignment 
    Y = data[[yvar]].copy() 
    X = data[xvars].copy() 
    X['intercept'] = 1 
    result = sm.OLS(Y,X).fit() 
    # resid is an attribute not a method 
    return result.resid 

def append_resids(df, yvar, xvars): 
    """New helper to return DataFrame object within groupby apply 
    df = df.copy() 
    df['residuals'] = regress(df, yvar, xvars) 
    return df 

df.groupby('TradingDate').apply(lambda x: append_resids(x, 'ILLIQ', ['Size']))