1
以下代碼用於獲取模型。我面臨的問題是將羣集號映射回客戶ID。這是因爲,我的模型接受了標準化數據的培訓,但帶有客戶ID的數據的數據沒有標準化。我無法弄清楚如何映射回去。Spark中的K-means(Scala) - 當從標準化數據建立模型時如何將簇號映射回客戶ID
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.util.MLUtils
// importing the data for clustering
val data = sc.textFile("hdfs://path/data_for_clus1")
val vectors = data.map(s => s.split('\1')).map(s => s.slice(1, s.size))
val parsedData = vectors.map(s => Vectors.dense(s.map(_.toDouble)))
val dataAsArray = parsedData.map(_.toArray)
// Using Standardscaler to standardize data
val features = dataAsArray.map(a => Vectors.dense(a))
val scaler = new StandardScaler(withMean = true, withStd = true).fit(features)
val scaledFeatures = scaler.transform(features)
val WSSEBuffer = ArrayBuffer[Double]();
// K-means
val numClusters = 20
val numIterations = 500
val clusters = KMeans.train(scaledFeatures, numClusters, numIterations)
val WSSSE = clusters.computeCost(scaledFeatures)
使用模型的「簇」,我想給表中的客戶ID集羣號碼。