7

我收到以下錯誤試圖建立一個ML Pipeline如何將ArrayType轉換爲PySpark DataFrame中的DenseVector?

pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type [email protected] but was actually ArrayType(DoubleType,true).' 

features列包含浮點值的數組。這聽起來像我需要將這些轉換爲某種類型的矢量(它不稀疏,所以DenseVector?)。有沒有辦法直接在DataFrame上執行此操作,還是需要將其轉換爲RDD?

回答

12

您可以使用UDF:

udf(lambda vs: Vectors.dense(vs), VectorUDT()) 

火花< 2.0進口:

from pyspark.mllib.linalg import Vectors, VectorUDT 

火花2.0+進口:

from pyspark.ml.linalg import Vectors, VectorUDT 

請注意,這些類不兼容儘管相同的實施。

也可以提取各個特徵並與VectorAssembler進行彙編。假設輸入列被稱爲features

from pyspark.ml.feature import VectorAssembler 

n = ... # Size of features 

assembler = VectorAssembler(
    inputCols=["features[{0}]".format(i) for i in range(n)], 
    outputCol="features_vector") 

assembler.transform(df.select(
    "*", *(df["features"].getItem(i) for i in range(n)) 
)) 
相關問題