2017-06-15 30 views
1

我使用多個StructFields創建StructType - 名稱和數據類型似乎工作正常,但無論在每個StructField中設置爲空還是爲False,生成的模式報告對於每個StructField,nullable爲True。Spark上的Pyspark 2.1.1,StructType中的StructFields總是可以爲空

任何人都可以解釋爲什麼嗎?謝謝!

from pyspark.sql import SparkSession 
from pyspark.sql.types import StructType, StringType, FloatType, TimestampType 

sparkSession = SparkSession.builder \ 
    .master("local") \ 
    .appName("SparkSession") \ 
    .getOrCreate() 


dfStruct = StructType().add("date", TimestampType(), False) 
dfStruct.add("open", FloatType(), False) 
dfStruct.add("high", FloatType(), False) 
dfStruct.add("low", FloatType(), False) 
dfStruct.add("close", FloatType(), False) 
dfStruct.add("ticker", StringType(), False) 

#print elements of StructType -- reports nullable is false 
for d in dfStruct: print d 

#data looks like this: 
#date,open,high,low,close,ticker 
# 2014-10-14 23:20:32,7.14,9.07,0.0,7.11,ARAY 
# 2014-10-14 23:20:36,9.74,10.72,6.38,9.25,ARC 
# 2014-10-14 23:20:38,31.38,37.0,28.0,30.94,ARCB 
# 2014-10-14 23:20:44,15.39,17.37,15.35,15.3,ARCC 
# 2014-10-14 23:20:49,5.59,6.5,5.31,5.48,ARCO 

#read csv file and apply dfStruct as the schema 
df = sparkSession.read.csv(path = "/<path tofile>/stock_data.csv", \ 
          schema = dfStruct, \ 
          sep = ",", \ 
          ignoreLeadingWhiteSpace = True, \ 
          ignoreTrailingWhiteSpace = True \ 
          ) 

#reports nullable as True! 
df.printSchema() 

回答

0

這是Spark中的known issue。 Spark中目前有一個open pull request旨在解決此問題。如果你真的需要你的領域是不可空的,請嘗試:

#read csv file and apply dfStruct as the schema 
df = sparkSession.read.csv(path = "/<path tofile>/stock_data.csv", \ 
         schema = dfStruct, \ 
         sep = ",", \ 
         ignoreLeadingWhiteSpace = True, \ 
         ignoreTrailingWhiteSpace = True \ 
         ).rdd.toDF(dfStruct) 
+0

很好玩,史蒂文 - 那個作品! – learnedOnPascal

+0

我不確定這樣的轉換速度,所以我不會將它用於太字節數據,但如果您只是閱讀csv文件,它應該工作得很好。 –

相關問題