2017-01-13 62 views
1

我面對java.io.IOException: s3n://bucket-name : 400 : Bad Request error同時加載紅移數據400錯誤的請求:亞馬遜S3A返回通過<a href="https://github.com/databricks/spark-redshift" rel="nofollow noreferrer">spark-redshift library</a>星火紅移庫

紅移集羣和S3存儲都是在孟買地區

以下是完整的錯誤堆棧:

2017-01-13 13:14:22 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, master): java.io.IOException: s3n://bucket-name : 400 : Bad Request 
      at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:453) 
      at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427) 
      at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411) 
      at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181) 
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
      at java.lang.reflect.Method.invoke(Method.java:498) 
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) 
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) 
      at org.apache.hadoop.fs.s3native.$Proxy10.retrieveMetadata(Unknown Source) 
      at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:476) 
      at com.databricks.spark.redshift.RedshiftRecordReader.initialize(RedshiftInputFormat.scala:115) 
      at com.databricks.spark.redshift.RedshiftFileFormat$$anonfun$buildReader$1.apply(RedshiftFileFormat.scala:92) 
      at com.databricks.spark.redshift.RedshiftFileFormat$$anonfun$buildReader$1.apply(RedshiftFileFormat.scala:80) 
      at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:279) 
      at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:263) 
      at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116) 
      at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) 
      at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) 
      at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 
      at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) 
      at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) 
      at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 
      at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) 
      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) 
      at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) 
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) 
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) 
      at org.apache.spark.scheduler.Task.run(Task.scala:86) 
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
      at java.lang.Thread.run(Thread.java:745) 
    Caused by: org.jets3t.service.impl.rest.HttpException: 400 Bad Request 
      at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:425) 
      at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:279) 
      at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:1052) 
      at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2264) 
      at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2193) 
      at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1120) 
      at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:575) 
      at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:174) 
      ... 30 more 

這裏是同我的Java代碼:

SparkContext sparkContext = SparkSession.builder().appName("CreditModeling").getOrCreate().sparkContext(); 
sparkContext.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem"); 
sparkContext.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", fs_s3a_awsAccessKeyId); 
sparkContext.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", fs_s3a_awsSecretAccessKey); 
sparkContext.hadoopConfiguration().set("fs.s3a.endpoint", "s3.ap-south-1.amazonaws.com"); 

SQLContext sqlContext=new SQLContext(sparkContext); 
Dataset dataset= sqlContext 
     .read() 
     .format("com.databricks.spark.redshift") 
     .option("url", redshiftUrl) 
     .option("query", query) 
     .option("aws_iam_role", aws_iam_role) 
     .option("tempdir", "s3a://bucket-name/temp-dir") 
     .load(); 

我能夠做解決問題上火花本地模式以下變化(參考this):

1)我已將jets3t jar更換爲0.9.4

2)更改的JetS3t配置屬性支持aws4版本鬥如下:

Jets3tProperties myProperties = Jets3tProperties.getInstance(Constants.JETS3T_PROPERTIES_FILENAME); 
myProperties.setProperty("s3service.s3-endpoint", "s3.ap-south-1.amazonaws.com"); 
myProperties.setProperty("storage-service.request-signature-version", "AWS4-HMAC-SHA256"); 
myProperties.setProperty("uploads.stream-retry-buffer-size", "2147483646"); 

但現在我想在集羣模式(火花獨立模式與運行作業資源管理器MESOS)並再次出現錯誤:(

任何幫助,將不勝感激!

回答

1

實際的問題:

更新Jets3tProperties,支持AWS S3簽名版本4,在運行時間上工作本地模式,但不在集羣模式下,因爲屬性只在驅動程序JVM上得到更新,而不在任何執行程序JVM上進行更新。

解決方案:

我找到了一個解決方法參照this鏈接更新所有執行人Jets3tProperties。

通過引用上面的鏈接,我已經放入了一個額外的代碼片段來更新Jets3tProperties .foreachPartition()函數,該函數將爲任何執行者上創建的第一個分區運行它。

下面是代碼:

Dataset dataset= sqlContext 
      .read() 
      .format("com.databricks.spark.redshift") 
      .option("url", redshiftUrl) 
      .option("query", query) 
      .option("aws_iam_role", aws_iam_role) 
      .option("tempdir", "s3a://bucket-name/temp-dir") 
      .load(); 

dataset.foreachPartition(rdd -> { 
    boolean first=true; 
    if(first){ 
     Jets3tProperties myProperties = 
       Jets3tProperties.getInstance(Constants.JETS3T_PROPERTIES_FILENAME); 
     myProperties.setProperty("s3service.s3-endpoint", "s3.ap-south-1.amazonaws.com"); 
     myProperties 
       .setProperty("storage-service.request-signature-version", "AWS4-HMAC-SHA256"); 
     myProperties.setProperty("uploads.stream-retry-buffer-size", "2147483646"); 
     first = false; 
    } 
}); 
+0

有沒有更好的解決方案呢?我也卡在這裏。 –

+0

@SudevAmbadi回答你的問題是沒有任何地方沒有直接的解決方案,這是我不得不提的。它需要通過Jets3t庫來處理。 –

0

該堆棧意味着您正在使用基於jets3t的較舊的s3n連接器。您正在設置僅適用於S3a的權限,即較新的權限。使用像s3a://這樣的URL來獲取新條目。

鑑於您正在嘗試使用V4 API,您還需要設置fs.s3a.endpoint。 400 /壞請求響應是一個你會看到,如果你試圖用V4權威性地對中央endpointd

+0

感謝您的回覆@Steve勞倫:)我已經替換爲問題的實際值的變量名。現在,您可以看到我已經使用URL s3a://設置了tempdir,如您所述。 –

+0

此外,我已經把孟買地區的實際fs.s3a.endpoint值。如你所說,你對400 /錯誤請求的原因是正確的「400 /錯誤請求響應是如果你嘗試使用v4對中央端點進行認證,你會看到的。 –

+0

但是我在這裏所做的所有更改都可以在本地模式下完美工作,但不能在集羣模式下工作。所以我的猜測是它可能只是在驅動程序JVM上更新,而不是在執行程序JVM上更新。這有意義嗎? –

相關問題