我看到examples的人寫EMR輸出到HDFS,但我一直沒能找到它如何完成的例子。最重要的是,this documentation似乎表示,EMR流作業的--output參數必須是是S3存儲桶。如何將EMR流作業的輸出寫入HDFS?
當我真的嘗試運行一個腳本(在這種情況下,使用python streaming和mrJob)時,它會拋出一個「Invalid S3 URI」錯誤。
這裏的命令:
python my_script.py -r emr \
--emr-job-flow-id=j-JOBID --conf-path=./mrjob.conf --no-output \
--output hdfs:///my-output \
hdfs:///my-input-directory/my-files*.gz
而回溯...
Traceback (most recent call last):
File "pipes/sampler.py", line 28, in <module>
SamplerJob.run()
File "/Library/Python/2.7/site-packages/mrjob/job.py", line 483, in run
mr_job.execute()
File "/Library/Python/2.7/site-packages/mrjob/job.py", line 501, in execute
super(MRJob, self).execute()
File "/Library/Python/2.7/site-packages/mrjob/launch.py", line 146, in execute
self.run_job()
File "/Library/Python/2.7/site-packages/mrjob/launch.py", line 206, in run_job
with self.make_runner() as runner:
File "/Library/Python/2.7/site-packages/mrjob/job.py", line 524, in make_runner
return super(MRJob, self).make_runner()
File "/Library/Python/2.7/site-packages/mrjob/launch.py", line 161, in make_runner
return EMRJobRunner(**self.emr_job_runner_kwargs())
File "/Library/Python/2.7/site-packages/mrjob/emr.py", line 585, in __init__
self._output_dir = self._check_and_fix_s3_dir(self._output_dir)
File "/Library/Python/2.7/site-packages/mrjob/emr.py", line 776, in _check_and_fix_s3_dir
raise ValueError('Invalid S3 URI: %r' % s3_uri)
ValueError: Invalid S3 URI: 'hdfs:///input/sample'
我如何寫電子病歷數據流作業,到HDFS的輸出?它甚至有可能嗎?
這是一個老問題,但可能仍然活躍。通過查看MrJob來源,EMRJobRunner只接受輸出目的地的S3存儲桶。由於您使用的是「長壽命」集羣,因此可能會使用HadoopJobRunner('-r hadoop')來解決問題。儘管我無法實現工作解決方案... – 2016-03-03 14:09:12