Fillna PySpark數據幀與numpy的陣列錯誤

以下是我的Spark數據框與printSchema它下面的例子：Fillna PySpark數據幀與numpy的陣列錯誤

+--------------------+---+------+------+--------------------+ 
|   device_id|age|gender| group|    apps| 
+--------------------+---+------+------+--------------------+ 
|-9073325454084204615| 24|  M|M23-26|    null| 
|-8965335561582270637| 28|  F|F27-28|[1.0,1.0,1.0,1.0,...| 
|-8958861370644389191| 21|  M| M22-|[4.0,0.0,0.0,0.0,...| 
|-8956021912595401048| 21|  M| M22-|    null| 
|-8910497777165914301| 25|  F|F24-26|    null| 
+--------------------+---+------+------+--------------------+ 
only showing top 5 rows 

root 
|-- device_id: long (nullable = true) 
|-- age: integer (nullle = true) 
|-- gender: string (nullable = true) 
|-- group: string (nullable = true) 
|-- apps: vector (nullable = true)

我想，以填補空與np.zeros的「應用程序」列（19237）。但是當我執行

df.fillna({'apps': np.zeros(19237)}))

我得到一個錯誤

Py4JJavaError: An error occurred while calling o562.fill. 
: java.lang.IllegalArgumentException: Unsupported value type java.util.ArrayList

或者，如果我嘗試

df.fillna({'apps': DenseVector(np.zeros(19237)})))

我得到一個錯誤

AttributeError: 'numpy.ndarray' object has no attribute '_get_object_id'

任何想法？

來源

2017-06-06 Erv

DataFrameNaFunctions僅支持本機（無UDT）類型的一個子集，因此您需要在此處使用UDF。

from pyspark.sql.functions import coalesce, col, udf 
from pyspark.ml.linalg import Vectors, VectorUDT 

def zeros(n): 
    def zeros_(): 
     return Vectors.sparse(n, {}) 
    return udf(zeros_, VectorUDT())()

實例：

df = spark.createDataFrame(
    [(1, Vectors.dense([1, 2, 3])), (2, None)], 
    ("device_id", "apps")) 

df.withColumn("apps", coalesce(col("apps"), zeros(3))).show()

+---------+-------------+ 
|device_id|   apps| 
+---------+-------------+ 
|  1|[1.0,2.0,3.0]| 
|  2| (3,[],[])| 
+---------+-------------+

來源

2017-06-06 17:46:37 user6910411

Fillna PySpark數據幀與numpy的陣列錯誤

回答

相關問題