2017-04-19 79 views
0

我在Spark中有邏輯迴歸模型。
我想從輸出向量中提取標籤= 1的概率並計算areaUnderROC。Spark中邏輯迴歸模型的areaUnderROC計算

val assembler = new VectorAssembler() 
.setInputCols(Array("A","B","C","D","E"))--for example 
.setOutputCol("features") 

val data = assembler.transform(logregdata) 

val Array(training,test) = data.randomSplit(Array(0.7,0.3),seed=12345) 
val training1 = training.select("label", "features") 
val test1 = test.select("label", "features") 

val lr = new LogisticRegression() 
val model = lr.fit(training1) 
val results = model.transform(test1) 
results.show() 

label|   features|  rawPrediction| probability| prediction| 
+-----+--------------------+--------------------+--------------------+---------- 

    0.0|(54,[13,31,34,35,...|[2.44227333947447...|[0.91999457581425...|  0.0| 

import org.apache.spark.mllib.evaluation.MulticlassMetrics 

val predictionAndLabels =results.select($"probability",$"label").as[(Double,Double)].rdd 
val metrics = new MulticlassMetrics(predictionAndLabels) 
val auROC= metrics.areaUnderROC() 

概率看起來像這樣:[0.9199945758142595,0.0800054241857405]
如何可以從矢量提取標籤= 1的概率,並計算AUC?

+0

我不明白這個問題。這不是默認情況下UNROC將計算的區域嗎? – jamborta

+0

它假設是。在Python中,相同的模型返回AUC = 91%,Spark AUC = 73%。我想手動測試它。我如何從矢量中提取概率值? – Liron

回答

0

您可以從底層的RDD中獲得價值。這將返回tuple與您的原始標籤和P(label=1)的預測值:

val predictions = results.map(row => (row.getAs[Double]("label"), row.getAs[Vector]("probability")(0))) 
+0

我試過了,它不工作......我得到這個警告: org.apache.spark.sql.AnalysisException:無法從概率#5477提取值; – Liron

+0

謝謝。似乎它的工作。預測:org.apache.spark.sql.Dataset [(Double,Double)] = [_1:double,_2:double] 但我無法顯示結果。我得到這個錯誤:org.apache.spark.ml.linalg.DenseVector不能轉換爲org.apache.spark.mllib.linalg.Vector。我如何看到我收到的預測? – Liron

+0

我不能重現你的錯誤,但你可以嘗試指定確切的類型:'import org.apache.spark.ml.linalg.DenseVector',然後'val predictions = results.map(row =>(row.getAs [Double ](「label」),row.getAs [DenseVector](「probability」)(0)))' – jamborta