我正在嘗試學習如何使用hadoop streaming。我試圖運行一個非常簡單的映射器,並沒有減速器。當我運行程序時,它完成了100%的地圖任務,然後在10分鐘內不做任何事情,然後報告它完成了所有地圖任務的0%。我認爲這意味着節點管理員不得不關閉工作,不確定。我在過去等了半個小時,並且從未結束。Hadoop Streaming Never Finishes
我正在使用hadoop 1.2.1。它的文檔說它配備了hadoop streaming jar,但是我找不到它,所以我從中央Maven倉庫下載了hadoop-streaming-1.2.1。這裏是命令行:
[[email protected] data]$ hadoop jar /hadoop/hadoop-streaming-1.2.1.jar -D mapred.reduce.tasks=0 -input /stock -output /company_index -mapper /home/msknapp/workspace/stock/stock.mr/scripts/firstLetterMapper.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
packageJobJar: [] [/opt/hadoop-1.2.1/hadoop-streaming-1.2.1.jar] /tmp/streamjob7222367580107633928.jar tmpDir=null
13/12/22 07:04:14 WARN snappy.LoadSnappy: Snappy native library is available
13/12/22 07:04:14 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/12/22 07:04:14 INFO snappy.LoadSnappy: Snappy native library loaded
13/12/22 07:04:14 INFO mapred.FileInputFormat: Total input paths to process : 1
13/12/22 07:04:17 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-msknapp/mapred/local]
13/12/22 07:04:17 INFO streaming.StreamJob: Running job: job_201312201826_0009
13/12/22 07:04:17 INFO streaming.StreamJob: To kill this job, run:
13/12/22 07:04:17 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201312201826_0009
13/12/22 07:04:17 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201312201826_0009
13/12/22 07:04:18 INFO streaming.StreamJob: map 0% reduce 0%
13/12/22 07:04:44 INFO streaming.StreamJob: map 100% reduce 0%
13/12/22 07:14:44 INFO streaming.StreamJob: map 0% reduce 0%
13/12/22 07:15:09 INFO streaming.StreamJob: map 100% reduce 0%
我調用的python腳本非常簡單。我安裝了python 2.6.6。該腳本時,我測試了:
#!/usr/bin/env
import sys
import string
#import os
def map(instream=sys.stdin,outstream=sys.stdout):
line = instream.readline()
output=map_line(line)
outstream.write(output)
def map_line(line):
parts=string.split(line,"\t")
key=parts[0]
newkey=key[0]
newvalue=key
output=newkey+"\t"+newvalue
return output
map()
輸入文件很短,簡單,它已經制表符分隔像「GE \ tGeneral電器」行,我相信他們的標籤。
順便說一句我在CentWare 1.6上,在VMWare虛擬機上以僞分佈模式運行hadoop 1.2.1。
有人請向我解釋爲什麼這不起作用,我可以做些什麼來解決它?
你不能再次嘗試沒有選項'-D mapred.reduce.tasks = 0' – zhutoulala