Spark mllib線性迴歸給出了非常糟糕的結果

我一直在收到真的很差的結果當嘗試使用Spark mllib的LinearRegressionWithSGD使用Python進行線性迴歸時。Spark mllib線性迴歸給出了非常糟糕的結果

我看着similiar問題，如下所示：

我清楚地知道，關鍵是調整參數恰到好處。

我也明白，隨機梯度下降不一定會找到一個最佳的解決方案（如交替最小二乘），因爲有機會陷入局部極小值。但至少我會期望找到一個OK模型。

這是我的設置，我選擇使用統計學雜誌的this example和相應的dataset。我從這篇論文中得知（並且從JMP中複製結果），如果僅使用數字字段，我應該得到類似於以下等式的東西（R^2約爲44％，RMSE約爲7400）：

價格= 7323 - 0.171里程+ 3200油缸 - 1463門+ 6206克魯斯 - 2024音響+ 3327皮革

因爲我不知道如何設置參數恰到好處，我跑了以下暴力方式：

from collections import Iterable 
from pyspark import SparkContext 
from pyspark.mllib.regression import LabeledPoint 
from pyspark.mllib.regression import LinearRegressionWithSGD 
from pyspark.mllib.evaluation import RegressionMetrics 

def f(n): 
    return float(n) 

if __name__ == "__main__": 
    sc = SparkContext(appName="LinearRegressionExample") 

    # CSV file format: 
    # 0  1  2  3  4  5  6   7  8  9  10  11 
    # Price, Mileage, Make, Model, Trim, Type, Cylinder, Liter, Doors, Cruise, Sound, Leather 
    raw_data = sc.textFile('file:///home/ccastroh/training/pyspark/kuiper.csv') 

    # Grabbing numerical values only (for now) 
    data = raw_data \ 
     .map(lambda x : x.split(',')) \ 
     .map(lambda x : [f(x[0]), f(x[1]), f(x[6]), f(x[8]), f(x[9]), f(x[10]), f(x[11])]) 
    points = data.map(lambda x : LabeledPoint(x[0], x[1:])).cache() 

    print "Num, Iterations, Step, MiniBatch, RegParam, RegType, Intercept?, Validation?, " + \ 
     "RMSE, R2, EXPLAINED VARIANCE, INTERCEPT, WEIGHTS..." 
    i = 0 
    for ite in [10, 100, 1000]: 
     for stp in [1, 1e-01, 1e-02, 1e-03, 1e-04, 1e-05, 1e-06, 1e-07, 1e-08, 1e-09, 1e-10]: 
     for mini in [0.2, 0.4, 0.6, 0.8, 1.0]: 
      for regP in [0.0, 0.1, 0.01, 0.001]: 
      for regT in [None, 'l1', 'l2']: 
       for intr in [True]: 
       for vald in [False, True]: 
        i += 1 

        message = str(i) + \ 
         "," + str(ite) + \ 
         "," + str(stp) + \ 
         "," + str(mini) + \ 
         "," + str(regP) + \ 
         "," + str(regT) + \ 
         "," + str(intr) + \ 
         "," + str(vald) 

        model = LinearRegressionWithSGD.train(points, iterations=ite, step=stp, \ 
         miniBatchFraction=mini, regParam=regP, regType=regT, intercept=intr, \ 
         validateData=vald) 

        predictions_observations = points \ 
         .map(lambda p : (float(model.predict(p.features)), p.label)).cache() 
        metrics = RegressionMetrics(predictions_observations) 
        message += "," + str(metrics.rootMeanSquaredError) \ 
        + "," + str(metrics.r2) \ 
        + "," + str(metrics.explainedVariance) 

        message += "," + str(model.intercept) 
        for weight in model.weights: 
         message += "," + str(weight) 

        print message 
    sc.stop()

正如你可以se e，我基本上跑了3960個不同的變化。我沒有收到任何與論文或JMP中的公式類似的東西。這裏有一些亮點：

在很多我得到NaN的截距的奔跑和重量
最高的R^2，我得到的是-0.89。我甚至不知道你會得到一個負面的R^2。原來一個負值表示所選的型號爲fits worse than a horizontal line。
，我得到了最低的RMSE爲13600，這比預期的7400

我也試過normalizing the values，以便有在[0,1]區間的方式更糟，而沒有幫助要麼

有沒有人有任何想法如何得到一個體面的線性迴歸模型？我錯過了什麼嗎？

來源

2016-06-08 Carlos Andres Castro

也有類似的問題。使用DecisionTree和RandomForest迴歸工作正常，如果你想有一個相當準確的解決方案，生產連續標籤並不是很好。

然後測試線性迴歸，就像您對每個參數使用多個值一樣，也使用多個數據集，並且沒有得到遠離真實值的任何解決方案。還試圖在訓練模型之前使用StandardScaler進行特徵縮放，但也不盡如人意。 :-(

編輯：設置截距爲true可能會解決問題。

來源

2017-01-31 13:18:31 Anubis

Spark mllib線性迴歸給出了非常糟糕的結果

回答

相關問題