2014-06-18 78 views
0

我有一組Hadoop的數據流作業的,象下面這樣:爲什麼這些似乎正確的hadoop流python腳本不起作用?

bash的文件:

hadoop fs -rmr /tmp/someone/sentiment/ 

hadoop jar /opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/lib/hadoop-mapreduce/hadoop-streaming-2.2.0-cdh5.0.0-beta-2.jar \ 
      -input /user/hive/warehouse/tweetjsonsentilife20 \ 
      -output /tmp/someone/sentiment/ \ 
      -mapper "mapper_senti.py" \ 
      -reducer "reducer_senti5.py" \ 
      -file mapper_senti.py \ 
      -file reducer_senti5.py \ 
      -cmdenv dir1=http://ip-10-0-0-77.us-west-2.compute.internal:8088/home/ubuntu/aclImdb/train/pos \ 
      -cmdenv dir2=http://ip-10-0-0-77.us-west-2.compute.internal:8088/home/ubuntu/aclImdb/train/pos/ \ 

映射:

#!/usr/bin/env python 

import sys 

for line in sys.stdin: 
    keyval=line.strip().split("\t") 
    key,val=(keyval[0],keyval[0]) 
    if key!="\N" and val!="\N": 
     sys.stdout.write('%s\t%s\n' % (key, val)) 

減速機:

#!/usr/bin/env python 

import sys 
import os 





for line in sys.stdin: 
    keyval=line.strip().split("\t") 
    key,val=(keyval[0],keyval[1]) 
    limit=1 
    dir1=os.environ['dir1'] 
    dir2=os.environ['dir2'] 
    for file in os.listdir(dir1)[:limit]: 
     for word in set(open(dir2+file).read()): 
      value=word 
      print "%s\t%s" % (key,value) 

和錯誤:

rmr: DEPRECATED: Please use 'rm -r' instead. 
14/06/18 22:10:57 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion  interval = 1440 minutes, Emptier interval = 0 minutes. 
Moved: 'hdfs://ip-10-0-0-77.us-west-2.compute.internal:8020/tmp/natashac/sentiment' to  trash at: hdfs://ip-10-0-0-77.us-west-2.compute.internal:8020/user/ubuntu/.Trash /Current 
14/06/18 22:10:59 WARN streaming.StreamJob: -file option is deprecated, please use  generic option -files instead. 
packageJobJar: [mapper_senti.py, reducer_senti5.py] [/opt/cloudera/parcels /CDH-5.0.0-0.cdh5b2.p0.27/lib/hadoop-mapreduce/hadoop-streaming-2.2.0-cdh5.0.0-beta-2.jar]  /tmp/streamjob8672448578858048676.jar tmpDir=null 
14/06/18 22:11:00 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-0-77.us-west-2.compute.internal/10.0.0.77:8032 
14/06/18 22:11:01 INFO client.RMProxy: Connecting to ResourceManager at  ip-10-0-0-77.us-west-2.compute.internal/10.0.0.77:8032 
14/06/18 22:11:01 INFO mapred.FileInputFormat: Total input paths to process : 1 
14/06/18 22:11:01 INFO mapreduce.JobSubmitter: number of splits:2 
14/06/18 22:11:02 INFO mapreduce.JobSubmitter: Submitting tokens for job:  job_1402902347983_0058 
14/06/18 22:11:02 INFO impl.YarnClientImpl: Submitted application application_1402902347983_0058 
14/06/18 22:11:02 INFO mapreduce.Job: The url to track the job: http://ip-10-0-0-77.us- west-2.compute.internal:8088/proxy/application_1402902347983_0058/ 
14/06/18 22:11:02 INFO mapreduce.Job: Running job: job_1402902347983_0058 
14/06/18 22:11:09 INFO mapreduce.Job: Job job_1402902347983_0058 running in uber mode : false 
14/06/18 22:11:09 INFO mapreduce.Job: map 0% reduce 0% 
14/06/18 22:11:16 INFO mapreduce.Job: map 100% reduce 0% 
14/06/18 22:11:22 INFO mapreduce.Job: Task Id : attempt_1402902347983_0058_r_000001_0,  Status : FAILED 
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 
     at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320) 
     at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533) 
     at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134) 
     at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237) 
     at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459) 
     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) 
     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165) 
     at java.security.AccessController.doPrivileged(Native Method) 
     at javax.security.auth.Subject.doAs(Subject.java:415) 

似乎一切正常。但沒有奏效。似乎mapper是好的。但不是減速機。任何人都有這個想法嗎?

謝謝!

+0

您可以發佈您的jobtracker日誌嗎? –

+0

謝謝你的回覆!我會盡力找到它們。 – user3634601

回答

0

你認爲dir1和dir2會是什麼?減速器運行在由YARN分配的某個節點上,並且可能找不到這些

+0

謝謝你的回覆。我的dir1,dir2在物理上位於主節點上。我們有一個分佈式系統。一個主QA節點和4個從QA節點。 dir1和dir2位於主節點上。我們使用AWS。我們只是將節點更改爲一些舊節點,並退出了舊節點。我依稀記得hadoop串流作業在舊節點上工作。不確定。謝謝! – user3634601

+0

YARN將分配一個隨機節點(取決於負載)來處理reduce - 即很可能不在定義dir1和dir2的節點上 –

+0

謝謝Arnon。我只是困惑。如果YARN找不到這些。其他人的-cmdenv在Python中的流式bash文件和os.environ如何工作?謝謝。爲什麼ppl製作了這個-cmdenv和os.environ?這是沒有用的,我覺得如果是這樣的話。 – user3634601

相關問題