PySpark架構無法識別

我試圖用這個模式來轉換一個CSV文件：PySpark架構無法識別

sch = StructType([ 
    StructField("id", StringType(), True), 
    StructField("words", ArrayType((StringType())), True) 
]) 

dataFile = 'mycsv.csv' 

df = sqlContext.read.option("mode", "DROPMALFORMED").schema(sch).option("delimiter", format(",")).option("charset", "UTF-8").load(dataFile, format='com.databricks.spark.csv', header='true', inferSchema='false')

mycsv.csv包含：

id , words 
a , test here

我希望DF包含[Row(id='a', words=['test' , 'here'])]

但而不是它的空陣列df.collect()返回[]

我的模式是否正確定義？

來源

2017-04-21 blue-sky

那麼，顯然你的words列不是類型Array它的類型只有StringType（）。並且由於您啓用了DROPMALFORMED，因此它會刪除記錄，因爲它不匹配Array模式。嘗試架構如下圖所示，它應該很好地工作 -

sch = StructType([ 
    StructField("id", StringType(), True), 
    StructField("words", StringType(), True) 
])

編輯：如果你真的想第二列字的數組/列表，這樣做 -

from pyspark.sql.functions import split 
df.select(df.id,split(df.words," ").alias('words')).show()

此輸出：

+---+--------------+ 
| id|   words| 
+---+--------------+ 
| a |[, test, here]| 
+---+--------------+

來源

2017-04-21 14:25:38 Pushkr

PySpark架構無法識別

回答

相關問題