我試圖通過酸洗它來序列化Spark RDD,並將pickled文件直接讀入Python。酸洗Spark Spark RDD並將它讀入Python
a = sc.parallelize(['1','2','3','4','5'])
a.saveAsPickleFile('test_pkl')
然後我將test_pkl文件複製到我的本地。我如何直接將它們讀入Python?
pickle.load(open('part-00000','rb'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/pickle.py", line 1370, in load
return Unpickler(file).load()
File "/usr/lib64/python2.6/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib64/python2.6/pickle.py", line 970, in load_string
raise ValueError, "insecure string pickle"
ValueError: insecure string pickle
我認爲火花采用酸洗法比蟒蛇鹹菜方法不同(正確的是:當我嘗試正常鹹菜包,當我試圖讀取「test_pkl」的第一個泡菜部分失敗我如果我錯了)。有什麼辦法讓我從Spark中醃製數據,並從文件中直接將這個pickle對象讀入python中?
問題是,它不是一個鹹菜文件,而是一個[SequenceFile(https://wiki.apache.org/hadoop/SequenceFile)含有醃對象,我不知道有任何積極發展解析器用於Python中的SequenceFiles。 – zero323