2016-06-10 16 views
1

以下代碼用於獲取模型。我面臨的問題是將羣集號映射回客戶ID。這是因爲,我的模型接受了標準化數據的培訓,但帶有客戶ID的數據的數據沒有標準化。我無法弄清楚如何映射回去。Spark中的K-means(Scala) - 當從標準化數據建立模型時如何將簇號映射回客戶ID

import org.apache.spark.SparkContext._ 
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel} 
import org.apache.spark.mllib.linalg.Vectors 
import scala.collection.mutable.ArrayBuffer 
import org.apache.spark.mllib.feature.StandardScaler 
import org.apache.spark.mllib.util.MLUtils 
// importing the data for clustering 
val data = sc.textFile("hdfs://path/data_for_clus1") 
val vectors = data.map(s => s.split('\1')).map(s => s.slice(1, s.size)) 
val parsedData = vectors.map(s => Vectors.dense(s.map(_.toDouble)))  

val dataAsArray = parsedData.map(_.toArray) 
// Using Standardscaler to standardize data 
val features = dataAsArray.map(a => Vectors.dense(a)) 
val scaler = new StandardScaler(withMean = true, withStd = true).fit(features) 
val scaledFeatures = scaler.transform(features) 


val WSSEBuffer = ArrayBuffer[Double](); 
// K-means 
val numClusters = 20 
val numIterations = 500 
val clusters = KMeans.train(scaledFeatures, numClusters, numIterations) 
val WSSSE = clusters.computeCost(scaledFeatures) 

使用模型的「簇」,我想給表中的客戶ID集羣號碼。

回答

0

解析您的數據

val newdata = Array[(customerID, featureArray)] 

然後

newdata.map(customer => (customer._1, clusters.predict(customer._2))) 

不知道這是否是一個有效的方式還是不

相關問題