2016-05-13 23 views
0

我正在嘗試在scala中定義一個函數,以使用Spark對其進行迭代。 這裏是我的代碼:函數參數中的RDD [Vector]的錯誤

import org.apache.spark.{SparkConf, SparkContext} 
import org.apache.spark.sql.SQLContext 
import org.apache.spark.ml.{Pipeline, PipelineModel} 
import org.apache.spark.ml.clustering.KMeans 
import org.apache.spark.mllib.linalg.Vectors 

import org.apache.spark.ml.feature.VectorIndexer 
import org.apache.spark.ml.feature.VectorAssembler 
import org.apache.spark.rdd._ 

    val assembler = new VectorAssembler() 
      .setInputCols(Array("feature1", "feature2", "feature3")) 
      .setOutputCol("features") 
val assembled = assembler.transform(df) 

// measures the average distance to centroid, for a model built with a given k. 

def clusteringScore(data: RDD[Vector],k:Int) = { 

val kmeans = new KMeans() 
    .setK(k) 
    .setFeaturesCol("features") 
    .setPredictionCol("prediction") 
    val model = kmeans.fit(data) 

    val WSSSE = model.computeCost(data) println(s"Within Set Sum of Squared Errors = $WSSSE") 

} 

(5 to 40 by 5).map(k => (k, clusteringScore(assembled, k))). 
     foreach(println) 

有了這個代碼,我得到這個錯誤:

type Vector takes type parameters 

我不知道是什麼意思這個錯誤...

回答

4

您未能顯示進口,但是您可能正在導入Scala標準集合'Vector(這個採用類型參數,例如Vector[Int])而不是SparkML Vector,這是一種不同的類型,您應該像這樣導入:

import org.apache.spark.mllib.linalg.Vector