0

我使用pyspark寫序列文件的價值,關鍵是圖像文件名和值是字節字符串寫入圖像序列文件與pyspark

from PIL import Image 

def get_image(filename): 
s = StringIO() 
im=io.imread(filename) 
io.imsave(s, im) 
return [(filename, s)] 

rdd = sc.parallelize(filenames) 
rdd.flatMap(get_image).saveAsSequenceFile("/user/myname/output") 

代表的形象,但pyspark拋出這表明一個異常那鹹菜不支持的格式

Caused by: net.razorvine.pickle.InvalidOpcodeException: opcode not implemented: OBJ 
    at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:224) 
    at net.razorvine.pickle.Unpickler.load(Unpickler.java:85) 
    at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98) 
    at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151) 
    at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:150) 
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) 
    at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) 
    at scala.collection.AbstractIterator.to(Iterator.scala:1157) 
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) 
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) 
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) 
    at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1298) 
    at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1298) 
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) 
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:88) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
    ... 1 more 
+0

這是在Spark 2.0中嗎?我遇到了同樣的問題,但我沒有看到Spark 1.6的這個問題。 – jnesselr

回答

0

當你試圖編碼/序列化解碼蟒蛇類/對象進行酸洗的OBJ操作碼使用。在我的情況下,我不打算寫一個對象到序列文件,所以修復我只是修復該錯誤。

至於整體生態系統,問題是spark uses Pyrolite 4.13,但直到version 4.17,OBJ編碼/解碼沒有被引入Pyrolite庫。至於該怎麼做,我想你有幾個選擇:

  1. 說服火花維護者,通過拉請求或github問題,使用更高版本的Pyrolite。
  2. 使用該版本的Pyrolite構建您自己的Spark版本
  3. 不要將類/對象寫入序列文件。