2017-09-26 118 views

回答

2

您必須用新模式替換列。 ArrayType採用兩個參數elementType和containsNull。

from pyspark.sql.types import * 
from pyspark.sql.functions import udf 
x = [("a",["b","c","d","e"]),("g",["h","h","d","e"])] 
schema = StructType([StructField("key",StringType(), nullable=True), 
        StructField("values", ArrayType(StringType(), containsNull=False))]) 

df = spark.createDataFrame(x,schema = schema) 
df.printSchema() 
new_schema = ArrayType(StringType(), containsNull=True) 
udf_foo = udf(lambda x:x, new_schema) 
df.withColumn("values",udf_foo("values")).printSchema() 



root 
|-- key: string (nullable = true) 
|-- values: array (nullable = true) 
| |-- element: string (containsNull = false) 

root 
|-- key: string (nullable = true) 
|-- values: array (nullable = true) 
| |-- element: string (containsNull = true) 
+0

謝謝@ashwinds - 它幫助 – user2763088