我正在使用Spark 1.1。 我有一個Spark作業,只在一個存儲桶(即以......開頭的文件夾)中尋找某個特定模式的文件夾,並且應該只處理這些文件夾。使用globStatus和Google Cloud Storage存儲桶作爲輸入時無法運行Spark作業
FileSystem fs = FileSystem.get(new Configuration(true));
FileStatus[] statusArr = fs.globStatus(new Path(inputPath));
List<FileStatus> statusList = Arrays.asList(statusArr);
List<String> pathsStr = convertFileStatusToPath(statusList);
JavaRDD<String> paths = sc.parallelize(pathsStr);
但是,運行谷歌雲存儲路徑上此作業時:GS:我通過以下操作實現這一// rsync的-1/2014_07_31 *(採用最新的谷歌雲存儲連接器1.2.9) ,我得到以下錯誤:
4/10/13 10:28:38 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/10/13 10:28:38 INFO util.Utils: Successfully started service 'Driver' on port 60379.
14/10/13 10:28:38 INFO worker.WorkerWatcher: Connecting to worker akka.tcp://[email protected]:45212/user/Worker
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.IllegalArgumentException: Wrong bucket: rsync-1, in path: gs://rsync-1/2014_07_31*, expected bucket: hadoop-config
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.checkPath(GoogleHadoopFileSystem.java:100)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:294)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.makeQualified(GoogleHadoopFileSystemBase.java:457)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getGcsPath(GoogleHadoopFileSystem.java:163)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.globStatus(GoogleHadoopFileSystemBase.java:1052)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.globStatus(GoogleHadoopFileSystemBase.java:1027)
at com.doit.customer.dataconverter.Phase0.main(Phase0.java:578)
... 6 more
當我在本地文件夾上運行此作業時,一切正常。
Hadoop的配置是一個桶我使用部署在谷歌Compute Engine的星火集羣(使用bdutil 0.35.2工具)