2017-08-06 47 views
0

我有一個DataFrame包含由VectorAssembler創建的特徵向量,它也包含空值。我現在想用一個載體來代替空值:火花填充DataFrame與矢量爲null

val nil = Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0) 

df.na.fill(nil) // does not work. 

什麼是做到這一點的正確方法?

編輯: 我發現要歸功於回答道:

val nil = Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0) 

import sc.implicits._ 
var fill = Seq(Tuple1(nil)).toDF("replacement") 

val dates = data.schema.fieldNames.filter(e => e.contains("1")) 

data = data.crossJoin(broadcast(fill)) 
for(e <- dates){ 
    data = data.withColumn(e, coalesce(data.col(e), $"replacement")) 
} 
data = data.drop("replacement") 

回答

1

如果問題增加了一些額外的行創建你加入與更換:

import org.apache.spark.sql.functions._ 

val df = Seq((1, None), (2, Some(nil))).toDF("id", "vector") 
val fill = Seq(Tuple1(nil)).toDF("replacement") 

df.crossJoin(broadcast(fill)).withColumn("vector", coalesce($"vector", $"replacement")).drop("replacement")