2017-08-01 85 views
1

我正在嘗試找到最佳閾值,讓我的邏輯迴歸具有最高的f1分數。然而,當我寫了下面幾行:f1score的Spark mllib閾值

val f1Score = metrics.fMeasureByThreshold 
f1Score.foreach { case (t, f) => 
println(s"Threshold: $t, F-score: $f, Beta = 1") 

一些奇怪的值出現,例如:

Threshold: 2.0939996826644833, F-score: 0.285648784961027, Beta = 1 
Threshold: 2.093727854652065, F-score: 0.28604171441668574, Beta = 1 
Threshold: 2.0904571465313113, F-score: 0.2864344637946838, Beta = 1 
Threshold: 2.0884466833553468, F-score: 0.28682703321878583, Beta = 1 
Threshold: 2.0882666552407283, F-score: 0.2872194228126431, Beta = 1 
Threshold: 2.0835997800203447, F-score: 0.2876116326997939, Beta = 1 
Threshold: 2.077892816382506, F-score: 0.28800366300366304, Beta = 1 

怎麼可能有大於一的門檻?對於在控制檯輸出中進一步顯示的負值也是如此。

回答

1

我犯了一個錯誤早些時候我的數據幀轉換時的RDD,而不是寫:

val predictionAndLabels =predictions.select("probability", "labelIndex").rdd.map(x => (x(0).asInstanceOf[DenseVector](1), x(1).asInstanceOf[Double])) 

我寫道:

val predictionAndLabels =predictions.select("rawPredictions", "labelIndex").rdd.map(x => (x(0).asInstanceOf[DenseVector](1), x(1).asInstanceOf[Double])) 

所以閾值分別對rawPredictions而不是概率,一切都很有意義