2016-04-12 112 views
15

我正在以CSV格式讀取Spark DataFrame並對其執行機器學習操作。我不斷得到一個Python序列化EOFError - 任何想法爲什麼?我認爲這可能是一個內存問題 - 即文件超出了可用內存 - 但是大幅減小DataFrame的大小並不能防止EOF錯誤。PySpark序列化EOFError

玩具代碼和錯誤如下。

#set spark context 
conf = SparkConf().setMaster("local").setAppName("MyApp") 
sc = SparkContext(conf = conf) 
sqlContext = SQLContext(sc) 

#read in 500mb csv as DataFrame 
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', 
    inferschema='true').load('myfile.csv') 

#get dataframe into machine learning format 
r_formula = RFormula(formula = "outcome ~ .") 
mldf = r_formula.fit(df).transform(df) 

#fit random forest model 
rf = RandomForestClassifier(numTrees = 3, maxDepth = 2) 
model = rf.fit(mldf) 
result = model.transform(mldf).head() 

在單個節點上運行與​​上面的代碼反覆引發以下錯誤,即使數據幀的尺寸被擬合模型之前減小(例如tinydf = df.sample(False, 0.00001)

Traceback (most recent call last): 
    File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/daemon.py", line 157, 
    in manager 
    File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/daemon.py", line 61, 
    in worker 
    File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/worker.py", line 136, 
    in main if read_int(infile) == SpecialLengths.END_OF_STREAM: 
    File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/serializers.py", line 545, 
    in read_int 
    raise EOFError 
    EOFError 
+1

你可以給Spark 2.1.0(剛發佈)一個機會嗎? –

+0

你也可以用'df'創建另一個DataFrame(手動)並重新開始嗎? –

+1

你可以把csv文件,你試圖讀取一些服務?所以我們可以看看。 –

回答

0

誤差出現在pySpark read_int功能發生代碼是從spark site如下:

def read_int(stream): 
length = stream.read(4) 
if not length: 
    raise EOFError 
return struct.unpack("!i", length)[0] 

竟被這d表示從流中讀取4個字節時,如果讀取0個字節,則會引發EOF錯誤。 python文檔是here