2016-03-07 73 views
4

我有一個有兩列的數據框,其中一個(稱爲dist)是密集向量。我如何將它轉換回整數數組列。將數據幀中的矢量列轉換回數組列

+---+-----+ 
| id| dist| 
+---+-----+ 
|1.0|[2.0]| 
|2.0|[4.0]| 
|3.0|[6.0]| 
|4.0|[8.0]| 
+---+-----+ 

我嘗試使用以下UDF的幾個變種,但它返回一個類型不匹配錯誤

val toInt4 = udf[Int, Vector]({ (a) => (a)}) 

val result = df.withColumn("dist", toDf4(df("dist"))).select("dist") 
+0

什麼是「標準」欄? –

+0

一個數組例如 – ulrich

+0

所以,你顯然想要在一個矢量中合併所有列,對嗎? –

回答

5

我認爲這是最容易通過進入RDD API,然後再去做。

import org.apache.spark.mllib.linalg.DenseVector 
import org.apache.spark.sql.DataFrame 
import org.apache.spark.rdd.RDD 
import sqlContext._ 

// The original data. 
val input: DataFrame = 
    sc.parallelize(1 to 4) 
    .map(i => i.toDouble -> new DenseVector(Array(i.toDouble * 2))) 
    .toDF("id", "dist") 

// Turn it into an RDD for manipulation. 
val inputRDD: RDD[(Double, DenseVector)] = 
    input.map(row => row.getAs[Double]("id") -> row.getAs[DenseVector]("dist")) 

// Change the DenseVector into an integer array. 
val outputRDD: RDD[(Double, Array[Int])] = 
    inputRDD.mapValues(_.toArray.map(_.toInt)) 

// Go back to a DataFrame. 
val output = outputRDD.toDF("id", "dist") 
output.show 

你得到:

+---+----+ 
| id|dist| 
+---+----+ 
|1.0| [2]| 
|2.0| [4]| 
|3.0| [6]| 
|4.0| [8]| 
+---+----+ 
4

在火花2.0,你可以這樣做:

import org.apache.spark.mllib.linalg.DenseVector 
import org.apache.spark.sql.functions.udf 

val vectorHead = udf{ x:DenseVector => x(0) } 
df.withColumn("firstValue", vectorHead(df("vectorColumn"))) 
+0

@ pwb2103提到第一行應該是import org.apache.spark.ml.linalg.DenseVector –

6

我掙扎了一段時間才能從@ThomasLuechtefeld工作答案。但也陷入了這個非常令人沮喪的錯誤:

org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(features_scaled)' due to data type mismatch: argument 1 requires vector type, however, '`features_scaled`' is of vector type. 

原來我需要從ML封裝而不是mllib包導入DenseVector。

所以這個工作對我來說:

import org.apache.spark.ml.linalg.DenseVector 
import org.apache.spark.sql.functions._ 

val vectorToColumn = udf{ (x:DenseVector, index: Int) => x(index) } 
myDataframe.withColumn("clusters_scaled",vectorToColumn(col("features_scaled"),lit(0))) 

是的,唯一不同的是第一道防線。這絕對應該是一個評論,但我沒有聲望。抱歉!