我有一組Hadoop的數據流作業的,象下面這樣:爲什麼這些似乎正確的hadoop流python腳本不起作用?
bash的文件:
hadoop fs -rmr /tmp/someone/sentiment/
hadoop jar /opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/lib/hadoop-mapreduce/hadoop-streaming-2.2.0-cdh5.0.0-beta-2.jar \
-input /user/hive/warehouse/tweetjsonsentilife20 \
-output /tmp/someone/sentiment/ \
-mapper "mapper_senti.py" \
-reducer "reducer_senti5.py" \
-file mapper_senti.py \
-file reducer_senti5.py \
-cmdenv dir1=http://ip-10-0-0-77.us-west-2.compute.internal:8088/home/ubuntu/aclImdb/train/pos \
-cmdenv dir2=http://ip-10-0-0-77.us-west-2.compute.internal:8088/home/ubuntu/aclImdb/train/pos/ \
映射:
#!/usr/bin/env python
import sys
for line in sys.stdin:
keyval=line.strip().split("\t")
key,val=(keyval[0],keyval[0])
if key!="\N" and val!="\N":
sys.stdout.write('%s\t%s\n' % (key, val))
減速機:
#!/usr/bin/env python
import sys
import os
for line in sys.stdin:
keyval=line.strip().split("\t")
key,val=(keyval[0],keyval[1])
limit=1
dir1=os.environ['dir1']
dir2=os.environ['dir2']
for file in os.listdir(dir1)[:limit]:
for word in set(open(dir2+file).read()):
value=word
print "%s\t%s" % (key,value)
和錯誤:
rmr: DEPRECATED: Please use 'rm -r' instead.
14/06/18 22:10:57 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1440 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://ip-10-0-0-77.us-west-2.compute.internal:8020/tmp/natashac/sentiment' to trash at: hdfs://ip-10-0-0-77.us-west-2.compute.internal:8020/user/ubuntu/.Trash /Current
14/06/18 22:10:59 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper_senti.py, reducer_senti5.py] [/opt/cloudera/parcels /CDH-5.0.0-0.cdh5b2.p0.27/lib/hadoop-mapreduce/hadoop-streaming-2.2.0-cdh5.0.0-beta-2.jar] /tmp/streamjob8672448578858048676.jar tmpDir=null
14/06/18 22:11:00 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-0-77.us-west-2.compute.internal/10.0.0.77:8032
14/06/18 22:11:01 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-0-77.us-west-2.compute.internal/10.0.0.77:8032
14/06/18 22:11:01 INFO mapred.FileInputFormat: Total input paths to process : 1
14/06/18 22:11:01 INFO mapreduce.JobSubmitter: number of splits:2
14/06/18 22:11:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1402902347983_0058
14/06/18 22:11:02 INFO impl.YarnClientImpl: Submitted application application_1402902347983_0058
14/06/18 22:11:02 INFO mapreduce.Job: The url to track the job: http://ip-10-0-0-77.us- west-2.compute.internal:8088/proxy/application_1402902347983_0058/
14/06/18 22:11:02 INFO mapreduce.Job: Running job: job_1402902347983_0058
14/06/18 22:11:09 INFO mapreduce.Job: Job job_1402902347983_0058 running in uber mode : false
14/06/18 22:11:09 INFO mapreduce.Job: map 0% reduce 0%
14/06/18 22:11:16 INFO mapreduce.Job: map 100% reduce 0%
14/06/18 22:11:22 INFO mapreduce.Job: Task Id : attempt_1402902347983_0058_r_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
似乎一切正常。但沒有奏效。似乎mapper是好的。但不是減速機。任何人都有這個想法嗎?
謝謝!
您可以發佈您的jobtracker日誌嗎? –
謝謝你的回覆!我會盡力找到它們。 – user3634601