2017-06-06 30 views
2

以下是我的Spark數據框與printSchema它下面的例子:Fillna PySpark數據幀與numpy的陣列錯誤

+--------------------+---+------+------+--------------------+ 
|   device_id|age|gender| group|    apps| 
+--------------------+---+------+------+--------------------+ 
|-9073325454084204615| 24|  M|M23-26|    null| 
|-8965335561582270637| 28|  F|F27-28|[1.0,1.0,1.0,1.0,...| 
|-8958861370644389191| 21|  M| M22-|[4.0,0.0,0.0,0.0,...| 
|-8956021912595401048| 21|  M| M22-|    null| 
|-8910497777165914301| 25|  F|F24-26|    null| 
+--------------------+---+------+------+--------------------+ 
only showing top 5 rows 

root 
|-- device_id: long (nullable = true) 
|-- age: integer (nullle = true) 
|-- gender: string (nullable = true) 
|-- group: string (nullable = true) 
|-- apps: vector (nullable = true) 

我想,以填補空與np.zeros的「應用程序」列(19237)。但是當我執行

df.fillna({'apps': np.zeros(19237)})) 

我得到一個錯誤

Py4JJavaError: An error occurred while calling o562.fill. 
: java.lang.IllegalArgumentException: Unsupported value type java.util.ArrayList 

或者,如果我嘗試

df.fillna({'apps': DenseVector(np.zeros(19237)}))) 

我得到一個錯誤

AttributeError: 'numpy.ndarray' object has no attribute '_get_object_id' 

任何想法?

回答

2

DataFrameNaFunctions僅支持本機(無UDT)類型的一個子集,因此您需要在此處使用UDF。

from pyspark.sql.functions import coalesce, col, udf 
from pyspark.ml.linalg import Vectors, VectorUDT 

def zeros(n): 
    def zeros_(): 
     return Vectors.sparse(n, {}) 
    return udf(zeros_, VectorUDT())() 

實例:

df = spark.createDataFrame(
    [(1, Vectors.dense([1, 2, 3])), (2, None)], 
    ("device_id", "apps")) 

df.withColumn("apps", coalesce(col("apps"), zeros(3))).show() 
+---------+-------------+ 
|device_id|   apps| 
+---------+-------------+ 
|  1|[1.0,2.0,3.0]| 
|  2| (3,[],[])| 
+---------+-------------+