1
我使用多個StructFields創建StructType - 名稱和數據類型似乎工作正常,但無論在每個StructField中設置爲空還是爲False,生成的模式報告對於每個StructField,nullable爲True。Spark上的Pyspark 2.1.1,StructType中的StructFields總是可以爲空
任何人都可以解釋爲什麼嗎?謝謝!
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StringType, FloatType, TimestampType
sparkSession = SparkSession.builder \
.master("local") \
.appName("SparkSession") \
.getOrCreate()
dfStruct = StructType().add("date", TimestampType(), False)
dfStruct.add("open", FloatType(), False)
dfStruct.add("high", FloatType(), False)
dfStruct.add("low", FloatType(), False)
dfStruct.add("close", FloatType(), False)
dfStruct.add("ticker", StringType(), False)
#print elements of StructType -- reports nullable is false
for d in dfStruct: print d
#data looks like this:
#date,open,high,low,close,ticker
# 2014-10-14 23:20:32,7.14,9.07,0.0,7.11,ARAY
# 2014-10-14 23:20:36,9.74,10.72,6.38,9.25,ARC
# 2014-10-14 23:20:38,31.38,37.0,28.0,30.94,ARCB
# 2014-10-14 23:20:44,15.39,17.37,15.35,15.3,ARCC
# 2014-10-14 23:20:49,5.59,6.5,5.31,5.48,ARCO
#read csv file and apply dfStruct as the schema
df = sparkSession.read.csv(path = "/<path tofile>/stock_data.csv", \
schema = dfStruct, \
sep = ",", \
ignoreLeadingWhiteSpace = True, \
ignoreTrailingWhiteSpace = True \
)
#reports nullable as True!
df.printSchema()
很好玩,史蒂文 - 那個作品! – learnedOnPascal
我不確定這樣的轉換速度,所以我不會將它用於太字節數據,但如果您只是閱讀csv文件,它應該工作得很好。 –