2
我使用LogisticRegressionWithLBFGS()
來訓練具有多個類的模型。使用Spark LogisticRegressionWithLBFGS進行多類分類的預測概率
從mllib
的文檔中可以看出clearThreshold()
只有在分類是二元的情況下才可以使用。有沒有辦法使用類似的多類分類來輸出模型中給定輸入中每個類的概率?
我使用LogisticRegressionWithLBFGS()
來訓練具有多個類的模型。使用Spark LogisticRegressionWithLBFGS進行多類分類的預測概率
從mllib
的文檔中可以看出clearThreshold()
只有在分類是二元的情況下才可以使用。有沒有辦法使用類似的多類分類來輸出模型中給定輸入中每個類的概率?
有兩種方法可以實現這一點。一個是創建一個假設的predictPoint
在LogisticRegression.scala
object ClassificationUtility {
def predictPoint(dataMatrix: Vector, model: LogisticRegressionModel):
(Double, Array[Double]) = {
require(dataMatrix.size == model.numFeatures)
val dataWithBiasSize: Int = model.weights.size/(model.numClasses - 1)
val weightsArray: Array[Double] = model.weights match {
case dv: DenseVector => dv.values
case _ =>
throw new IllegalArgumentException(s"weights only supports dense vector but got type ${model.weights.getClass}.")
}
var bestClass = 0
var maxMargin = 0.0
val withBias = dataMatrix.size + 1 == dataWithBiasSize
val classProbabilities: Array[Double] = new Array[Double (model.numClasses)
(0 until model.numClasses - 1).foreach { i =>
var margin = 0.0
dataMatrix.foreachActive { (index, value) =>
if (value != 0.0) margin += value * weightsArray((i * dataWithBiasSize) + index)
}
// Intercept is required to be added into margin.
if (withBias) {
margin += weightsArray((i * dataWithBiasSize) + dataMatrix.size)
}
if (margin > maxMargin) {
maxMargin = margin
bestClass = i + 1
}
classProbabilities(i+1) = 1.0/(1.0 + Math.exp(-margin))
}
return (bestClass.toDouble, classProbabilities)
}
}
注意它只是從原來的方法略有不同責任的方法,它只是計算物流作爲輸入要素的功能。它還定義了一些最初是私有的,幷包含在此方法之外的val和vars。最終,它將數組中的分數編入索引並將其與最佳答案一起返回。我打電話給我的方法,像這樣:
// Compute raw scores on the test set.
val predictionAndLabelsAndProbabilities = test
.map { case LabeledPoint(label, features) =>
val (prediction, probabilities) = ClassificationUtility
.predictPoint(features, model)
(prediction, label, probabilities)}
但是:
看來星火貢獻者不鼓勵有利於ML的使用MLlib的。 ML邏輯迴歸API目前不支持多類分類。我現在使用的是OneVsRest,它可以作爲一個分類與所有分類的包裝。您可以通過模型迭代獲得原始分數:
val lr = new LogisticRegression().setFitIntercept(true)
val ovr = new OneVsRest()
ovr.setClassifier(lr)
val ovrModel = ovr.fit(training)
ovrModel.models.zipWithIndex.foreach {
case (model: LogisticRegressionModel, i: Int) =>
model.save(s"model-${model.uid}-$i")
}
val model0 = LogisticRegressionModel.load("model-logreg_457c82141c06-0")
val model1 = LogisticRegressionModel.load("model-logreg_457c82141c06-1")
val model2 = LogisticRegressionModel.load("model-logreg_457c82141c06-2")
現在,你有個別型號,您可以通過計算rawPrediction
def sigmoid(x: Double): Double = {
1.0/(1.0 + Math.exp(-x))
}
val newPredictionAndLabels0 = model0.transform(newRescaledData)
.select("prediction", "rawPrediction")
.map(row => (row.getDouble(0),
sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1))))
newPredictionAndLabels0.foreach(println)
val newPredictionAndLabels1 = model1.transform(newRescaledData)
.select("prediction", "rawPrediction")
.map(row => (row.getDouble(0),
sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1))))
newPredictionAndLabels1.foreach(println)
val newPredictionAndLabels2 = model2.transform(newRescaledData)
.select("prediction", "rawPrediction")
.map(row => (row.getDouble(0),
sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1))))
newPredictionAndLabels2.foreach(println)
乙狀結腸獲得的概率