2017-01-06 133 views
4

我試圖通過大熊貓蟒蛇數據框採用線性迴歸上的一組:Python的大熊貓迴歸GROUPBY

這是數據幀DF:

group  date  value 
    A  01-02-2016  16 
    A  01-03-2016  15 
    A  01-04-2016  14 
    A  01-05-2016  17 
    A  01-06-2016  19 
    A  01-07-2016  20 
    B  01-02-2016  16 
    B  01-03-2016  13 
    B  01-04-2016  13 
    C  01-02-2016  16 
    C  01-03-2016  16 

#import standard packages 
import pandas as pd 
import numpy as np 

#import ML packages 
from sklearn.linear_model import LinearRegression 

#First, let's group the data by group 
df_group = df.groupby('group') 

#Then, we need to change the date to integer 
df['date'] = pd.to_datetime(df['date']) 
df['date_delta'] = (df['date'] - df['date'].min())/np.timedelta64(1,'D') 

現在我想預測對每個值小組爲01-10-2016。

我希望得到一個新的數據幀是這樣的:

group  01-10-2016 
    A  predicted value 
    B  predicted value 
    C  predicted value 

How to apply OLS from statsmodels to groupby不起作用

for group in df_group.groups.keys(): 
     df= df_group.get_group(group) 
     X = df['date_delta'] 
     y = df['value'] 
     model = LinearRegression(y, X) 
     results = model.fit(X, y) 
     print results.summary() 

我收到以下錯誤

ValueError: Found arrays with inconsistent numbers of samples: [ 1 52] 

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning) 

UPDATE:

我把它改成

for group in df_group.groups.keys(): 
     df= df_group.get_group(group) 
     X = df[['date_delta']] 
     y = df.value 
     model = LinearRegression(y, X) 
     results = model.fit(X, y) 
     print results.summary() 

,現在我得到這個錯誤:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). 
+0

請解釋一下你的意思是「它不工作」。它會引發錯誤嗎?如果是這樣,請包含回溯。如果不是,你的預期產出是多少?你得到的產出是多少? – ayhan

+1

@ayhan - 完成!謝謝 – jeangelj

+1

你在循環中破壞了你的'df'。 – piRSquared

回答

5

新建答案

def model(df, delta): 
    y = df[['value']].values 
    X = df[['date_delta']].values 
    return np.squeeze(LinearRegression().fit(X, y).predict(delta)) 

def group_predictions(df, date): 
    date = pd.to_datetime(date) 
    df.date = pd.to_datetime(df.date) 

    day = np.timedelta64(1, 'D') 
    mn = df.date.min() 
    df['date_delta'] = df.date.sub(mn).div(day) 

    dd = (date - mn)/day 

    return df.groupby('group').apply(model, delta=dd) 

演示

group_predictions(df, '01-10-2016') 

group 
A 22.333333333333332 
B  3.500000000000007 
C     16.0 
dtype: object 

舊答案

您正在使用LinearRegression錯誤。

  • 您不要將它與數據配合使用。只需調用類這樣
    • model = LinearRegression()
  • 然後fit
    • model.fit(X, y)

不過這些都不會是存儲在model有在目標設定值是不好的summary方法。有可能是一個地方,但我知道一個在statsmodels SOOOO,見下文


選項1
使用statsmodels代替

from statsmodels.formula.api import ols 

for k, g in df_group: 
    model = ols('value ~ date_delta', g) 
    results = model.fit() 
    print(results.summary()) 

     OLS Regression Results        
============================================================================== 
Dep. Variable:     value R-squared:      0.652 
Model:       OLS Adj. R-squared:     0.565 
Method:     Least Squares F-statistic:      7.500 
Date:    Fri, 06 Jan 2017 Prob (F-statistic):    0.0520 
Time:      10:48:17 Log-Likelihood:    -9.8391 
No. Observations:     6 AIC:        23.68 
Df Residuals:      4 BIC:        23.26 
Df Model:       1           
Covariance Type:   nonrobust           
============================================================================== 
       coef std err   t  P>|t|  [95.0% Conf. Int.] 
------------------------------------------------------------------------------ 
Intercept  14.3333  1.106  12.965  0.000  11.264 17.403 
date_delta  1.0000  0.365  2.739  0.052  -0.014  2.014 
============================================================================== 
Omnibus:       nan Durbin-Watson:     1.393 
Prob(Omnibus):     nan Jarque-Bera (JB):    0.461 
Skew:       -0.649 Prob(JB):      0.794 
Kurtosis:      2.602 Cond. No.       5.78 
============================================================================== 

Warnings: 
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. 
          OLS Regression Results        
============================================================================== 
Dep. Variable:     value R-squared:      0.750 
Model:       OLS Adj. R-squared:     0.500 
Method:     Least Squares F-statistic:      3.000 
Date:    Fri, 06 Jan 2017 Prob (F-statistic):    0.333 
Time:      10:48:17 Log-Likelihood:    -3.2171 
No. Observations:     3 AIC:        10.43 
Df Residuals:      1 BIC:        8.631 
Df Model:       1           
Covariance Type:   nonrobust           
============================================================================== 
       coef std err   t  P>|t|  [95.0% Conf. Int.] 
------------------------------------------------------------------------------ 
Intercept  15.5000  1.118  13.864  0.046   1.294 29.706 
date_delta -1.5000  0.866  -1.732  0.333  -12.504  9.504 
============================================================================== 
Omnibus:       nan Durbin-Watson:     3.000 
Prob(Omnibus):     nan Jarque-Bera (JB):    0.531 
Skew:       -0.707 Prob(JB):      0.767 
Kurtosis:      1.500 Cond. No.       2.92 
============================================================================== 

Warnings: 
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. 
          OLS Regression Results        
============================================================================== 
Dep. Variable:     value R-squared:      -inf 
Model:       OLS Adj. R-squared:     -inf 
Method:     Least Squares F-statistic:     -0.000 
Date:    Fri, 06 Jan 2017 Prob (F-statistic):    nan 
Time:      10:48:17 Log-Likelihood:     63.481 
No. Observations:     2 AIC:       -123.0 
Df Residuals:      0 BIC:       -125.6 
Df Model:       1           
Covariance Type:   nonrobust           
============================================================================== 
       coef std err   t  P>|t|  [95.0% Conf. Int.] 
------------------------------------------------------------------------------ 
Intercept  16.0000  inf   0  nan   nan  nan 
date_delta -3.553e-15  inf   -0  nan   nan  nan 
============================================================================== 
Omnibus:       nan Durbin-Watson:     0.400 
Prob(Omnibus):     nan Jarque-Bera (JB):    0.333 
Skew:       0.000 Prob(JB):      0.846 
Kurtosis:      1.000 Cond. No.       2.62 
============================================================================== 
+0

謝謝@piRSquared;有線性迴歸的方法嗎?我正在嘗試創建一個數據框,其中包含未來日期的預測值。使用OLS彙總方法,我將不得不手動查找每個組的公式並計算01-10-2016 – jeangelj

+0

@jeangelj我已更新我的文章 – piRSquared

+0

非常感謝您;我還有一個問題,所以我完全理解你是如何解決這個問題的,如果我的日期格式是數據集中的「2016-01-10」格式;它會如何改變代碼? – jeangelj