1
我試圖在PySpark中使用我在Kaggle上找到的住房數據集做一個非常簡單的LinearRegression
。有很多列,但爲了儘可能簡化(實際上),我只保留了兩列(在開始所有列之後),仍然沒有運氣讓模型得到訓練。這是該數據幀的樣子之前通過迴歸步打算:找不到Spark LinearRegression的原因錯誤
2016-09-07 17:12:08,804 root INFO [Row(price=78000.0, sqft_living=780.0, sqft_lot=16344.0, features=DenseVector([780.0, 16344.0])), Row(price=80000.0, sqft_living=430.0, sqft_lot=5050.0, features=DenseVector([430.0, 5050.0])), Row(price=81000.0, sqft_living=730.0, sqft_lot=9975.0, features=DenseVector([730.0, 9975.0])), Row(price=82000.0, sqft_living=860.0, sqft_lot=10426.0, features=DenseVector([860.0, 10426.0])), Row(price=84000.0, sqft_living=700.0, sqft_lot=20130.0, features=DenseVector([700.0, 20130.0])), Row(price=85000.0, sqft_living=830.0, sqft_lot=9000.0, features=DenseVector([830.0, 9000.0])), Row(price=85000.0, sqft_living=910.0, sqft_lot=9753.0, features=DenseVector([910.0, 9753.0])), Row(price=86500.0, sqft_living=840.0, sqft_lot=9480.0, features=DenseVector([840.0, 9480.0])), Row(price=89000.0, sqft_living=900.0, sqft_lot=4750.0, features=DenseVector([900.0, 4750.0])), Row(price=89950.0, sqft_living=570.0, sqft_lot=4080.0, features=DenseVector([570.0, 4080.0]))]
我用下面的代碼來訓練模型:
standard_scaler = StandardScaler(inputCol='features',
outputCol='scaled')
lr = LinearRegression(featuresCol=standard_scaler.getOutputCol(), labelCol='price', weightCol=None,
maxIter=100, tol=1e-4)
pipeline = Pipeline(stages=[standard_scaler, lr])
grid = (ParamGridBuilder()
.baseOn({lr.labelCol: 'price'})
.addGrid(lr.regParam, [0.1, 1.0])
.addGrid(lr.elasticNetParam, elastic_net_params or [0.0, 1.0])
.build())
ev = RegressionEvaluator(metricName="rmse", labelCol='price')
cv = CrossValidator(estimator=pipeline,
estimatorParamMaps=grid,
evaluator=ev,
numFolds=5)
model = cv.fit(data).bestModel
我得到的錯誤是:
2016-09-07 17:12:08,805 root INFO Training regression model...
2016-09-07 17:12:09,530 root ERROR An error occurred while calling o60.fit.
: java.lang.NullPointerException
at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:164)
at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:70)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
有什麼想法?
這裏的錯誤不是由StandardScaler造成的。這對我來說很好(顯然你的體驗不一樣)。該錯誤原來是「權重」列。當我試圖指定'weightCol = None'時,對我造成錯誤。我通過創建一個1.0的weightCol作爲重量來固定它(必須是浮點!)。 –