無法修復Pyspark：字段錯誤長度（％d）「％（len（obj），len（dataType.fields））

我對Pyspark比較陌生，我試圖找出一種特定類型的錯誤，被竊聽我無法修復Pyspark：字段錯誤長度（％d）「％（len（obj），len（dataType.fields））

lines = sc.textFile('train.csv') 
from pyspark.sql.types import *

的train.csv存儲here：。這是一個有點龐大

第一行包含列信息從數據

的第一行設置模式。

fields = [StructField(field_name, StringType(), True) for field_name in lines.first().split(',')] # I am setting the schema here 
schema = StructType(fields) 
mstr_header = lines.filter(lambda l: "Country" in l) #Ihave seen the first row of the data, I want to remove it. Only the first row contains 'Country' 
linesNoHeader = lines.subtract(mstr_header) 

lines_df = linesNoHeader.map(lambda x: x.split(",")).toDF(schema) #make a dataframe

當我運行lines_df.count()，PySpark拋出一個錯誤的說法，

length of fields (%d)" % (len(obj), len(dataType.fields)))ValueError: Length of object (18) does not match with length of fields (17)

我無法弄清楚我要去的地方錯了。爲大數據文件道歉。

來源

2017-09-18 kasa

問題從錯誤信息中或多或少清楚：在您的文件中，前幾行包含17個字段，但有相當多的幾行包含更多字段（實際上沒有更多的字段，但您期望的逗號更多當你只使用split）。

你可以很容易地檢查這只是用shell命令：

cat train.csv | awk -F ',' '{print NF-1, NR}' | grep -v "^16"

（其中計數的,每行的數量和丟棄不用正好是16，也就是17場的那些）。

第一個例子是線16807，它看起來像這樣：

16805,PH,-1.0,A,2017-08-21 00:03:13,Generic,android_webkit,Android," http://supertraff.com/l/32398308f0e2f715d41?vId=bmconv_20170820203313_6b3e1d81_172f_4fd6_8725_f8cc7522896d&sub=20386192139,8753761,5,3177&source=Unknown&test=a ",112.198.101.161,False,,0.0,282,,4901.0,0.0

看那網址的查詢，其中包含幾個,。幸運的是，您的csv是標準配置，並且在字段的開頭和結尾處有雙引號以便跳過,，因此您可以讀取它，例如使用數據筆csv閱讀器。

以this answer爲例或在documentation處查看。

來源

2017-09-19 09:56:08 lrnzcig

無法修復Pyspark：字段錯誤長度（％d）「％（len（obj），len（dataType.fields））

回答

相關問題