2017-06-01 46 views
3

我想運行在pyspark代碼(火花2.1.1):趴趴的類型必須爲org.apache.spark.ml.linalg.VectorUDT

from pyspark.ml.feature import PCA 

bankPCA = PCA(k=3, inputCol="features", outputCol="pcaFeatures") 
pcaModel = bankPCA.fit(bankDf)  
pcaResult = pcaModel.transform(bankDF).select("label", "pcaFeatures")  
pcaResult.show(truncate= false) 

但我得到這個錯誤:

requirement failed: Column features must be of type org.apache.spark.ml.linalg.Vect [email protected] but was actually [email protected] .

回答

1

的例子,你可以找到here

from pyspark.ml.feature import PCA 
from pyspark.ml.linalg import Vectors 

data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),), 
    (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), 
    (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)] 
df = spark.createDataFrame(data, ["features"]) 

pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures") 
model = pca.fit(df) 

... other code ... 

正如你可以在上面看到,DF是一個數據幀,其中包含從pyspark.ml.linalg導入的的Vectors.sparse()和Vectors.dense()。

也許,你bankDf包含載體進口從pyspark.mllib.linalg

所以,你必須設置載體在dataframes進口

from pyspark.ml.linalg import Vectors 

代替:

from pyspark.mllib.linalg import Vectors 

也許你可以發現,有趣此stackoverflow question

相關問題