2014-10-02 57 views
1

我試圖在Mac OS上的Spark上安裝Google Cloud Storage來對我的Spark應用進行本地測試。我已閱讀以下文件(https://cloud.google.com/hadoop/google-cloud-storage-connector)。我在spark/lib文件夾中添加了「gcs-connector-latest-hadoop2.jar」。我還在spark/conf目錄中添加了core-data.xml文件。問題Spark上的Google雲端存儲連接器

當我跑我的pyspark殼,我得到一個錯誤:

>>> sc.textFile("gs://mybucket/test.csv").count() 
    Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/Users/poiuytrez/Documents/DataBerries/programs/spark/python/pyspark/rdd.py", line 847, in count 
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 
    File "/Users/poiuytrez/Documents/DataBerries/programs/spark/python/pyspark/rdd.py", line 838, in sum 
    return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) 
    File "/Users/poiuytrez/Documents/DataBerries/programs/spark/python/pyspark/rdd.py", line 759, in reduce 
    vals = self.mapPartitions(func).collect() 
    File "/Users/poiuytrez/Documents/DataBerries/programs/spark/python/pyspark/rdd.py", line 723, in collect 
    bytesInJava = self._jrdd.collect().iterator() 
    File "/Users/poiuytrez/Documents/DataBerries/programs/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ 
    File "/Users/poiuytrez/Documents/DataBerries/programs/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o26.collect. 
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found 
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1895) 
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2379) 
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392) 
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) 
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431) 
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413) 
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) 
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) 
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256) 
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) 
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) 
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) 
    at scala.Option.getOrElse(Option.scala:120) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) 
    at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) 
    at scala.Option.getOrElse(Option.scala:120) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) 
    at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) 
    at scala.Option.getOrElse(Option.scala:120) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1135) 
    at org.apache.spark.rdd.RDD.collect(RDD.scala:774) 
    at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:305) 
    at org.apache.spark.api.java.JavaRDD.collect(JavaRDD.scala:32) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:606) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) 
    at py4j.Gateway.invoke(Gateway.java:259) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:207) 
    at java.lang.Thread.run(Thread.java:744) 
Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found 
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801) 
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1893) 
    ... 40 more 

我不知道接下來要去哪裏。

回答

1

需求它可能在Spark版本中有所不同,但如果您在bdutil-0.35.2/extensions/spark/install_spark.sh內部窺視,您將看到我們的「Spark + Hadoop on GCE」設置使用bdutil工作;它包括你提到的項,將所述連接器插入火花/ lib文件夾,以及將所述芯 - site.xml文件到火花/ conf目錄,但是另外增加了到spark/conf/spark-env.sh行:

export SPARK_CLASSPATH=\$SPARK_CLASSPATH:${LOCAL_GCS_JAR} 

其中${LOCAL_GCS_JAR}將是您添加到spark/lib的jar文件的絕對路徑。嘗試添加到您的spark/conf/spark-env.sh和ClassNotFoundException應該消失。

+0

我得到: 這在Spark 1.0+中已棄用。 請改用: - 與--driver-類路徑./spark-submit來增強駕駛員的classpath - spark.executor.extraClassPath充實到執行類路徑 但我有另外一個錯誤,當我嘗試訪問到我的存儲。我將創建一個新的SO問題。 – poiuytrez 2014-10-03 08:07:31

+0

我有一個元數據服務器錯誤,我解決了使用您對此問題的回覆:http://stackoverflow.com/questions/25291397/migrating-50tb-data-from-local-hadoop-cluster-to-google-cloud-storage – poiuytrez 2014-10-03 08:50:03

+0

將$ HADOOP_CLASSPATH添加到spark-env.sh中的$ SPARK_CLASSPATH中可以解決問題。 (至少它在Spark 1.2.1中對我有用) – 2015-03-06 20:57:35

相關問題