我對Pyspark比較陌生,我試圖找出一種特定類型的錯誤,被竊聽我 無法修復Pyspark:字段錯誤長度(%d)「%(len(obj),len(dataType.fields))
lines = sc.textFile('train.csv')
from pyspark.sql.types import *
的train.csv存儲here:。這是一個有點龐大
第一行包含列信息從數據
的第一行設置模式。fields = [StructField(field_name, StringType(), True) for field_name in lines.first().split(',')] # I am setting the schema here
schema = StructType(fields)
mstr_header = lines.filter(lambda l: "Country" in l) #Ihave seen the first row of the data, I want to remove it. Only the first row contains 'Country'
linesNoHeader = lines.subtract(mstr_header)
lines_df = linesNoHeader.map(lambda x: x.split(",")).toDF(schema) #make a dataframe
當我運行lines_df.count()
,PySpark拋出一個錯誤的說法,
length of fields (%d)" % (len(obj), len(dataType.fields)))ValueError: Length of object (18) does not match with length of fields (17)
我無法弄清楚我要去的地方錯了。爲大數據文件道歉。