我想實現一個簡單的Hadoop地圖使用Cloudera的5.5.0 地圖&減少步驟應該使用Python 2.6.6實現減少例如爲什麼hadoop mapreduce與python失敗,但腳本正在命令行上工作?
問題:
- 如果腳本正在在unix命令行上執行,他們工作得非常好,併產生預期的輸出。
cat join2 * .txt | ./join3_mapper.py |排序| ./join3_reducer.py
- 但執行腳本作爲Hadoop的任務非常失敗:
Hadoop的罐子/usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/cloudera/inputTV/join2_gen*.txt -output /用戶/ Cloudera的/ output_tv -mapper /home/cloudera/join3_mapper.py -reducer /home/cloudera/join3_reducer.py -numReduceTasks 1
16/01/06 12:32:32 INFO mapreduce.Job: Task Id : attempt_1452069211060_0026_r_000000_0, Status : FAILED Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538) at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134) at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
映射器的工作原理,如果hadoop的命令與-numReduceTasks 0, Hadoop的作業正在執行僅地圖步驟執行時,成功地結束,並且輸出目錄從地圖步驟包含結果的文件。
我想減法步驟一定有什麼問題呢?
- 在色彩上標準錯誤日誌不顯示任何有關:
日誌上傳時間:星期三年1月6 12時33分十秒-0800 2016 日誌長度:222 的log4j:警告沒有附加目的地可以發現記錄儀(org.apache.hadoop.ipc.Server)。 log4j:WARN請正確初始化log4j系統。 log4j:警告有關更多信息,請參見http://logging.apache.org/log4j/1.2/faq.html#noconfig。
代碼的腳本: 第一檔:join3_mapper.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip() #strip out carriage return
tuple2 = line.split(",") #split line, into key and value, returns a list
if len(tuple2) == 2:
key = tuple2[0]
value = tuple2[1]
if value == 'ABC':
print('%s\t%s' % (key, value))
elif value.isdigit():
print('%s\t%s' % (key, value))
的第二個選項:join3_reducer.py
#!/usr/bin/env python
import sys
last_key = None #initialize these variables
running_total = 0
abcFound =False;
this_key = None
# -----------------------------------
# Loop the file
# --------------------------------
for input_line in sys.stdin:
input_line = input_line.strip()
# --------------------------------
# Get Next Key value pair, splitting at tab
# --------------------------------
tuple2 = input_line.split("\t")
this_key = tuple2[0]
value = tuple2[1]
if value.isdigit():
value = int(value)
# ---------------------------------
# Key Check part
# if this current key is same
# as the last one Consolidate
# otherwise Emit
# ---------------------------------
if last_key == this_key:
if value == 'ABC': # filter for only ABC in TV shows
abcFound=True;
else:
if isinstance(value, (int,long)):
running_total += value
else:
if last_key: #if this key is different from last key, and the previous
# (ie last) key is not empy,
# then output
# the previous <key running-count>
if abcFound:
print('%s\t%s' % (last_key, running_total))
abcFound=False;
running_total = value #reset values
last_key = this_key
if last_key == this_key:
print('%s\t%s' % (last_key, running_total))
我曾嘗試聲明輸入文件到的各種不同的方式hadoop命令,沒有區別,沒有成功。
我在做什麼錯?提示,想法非常讚賞謝謝
您是否需要toolrunner才能夠從命令行運行jar文件? –
另外,Java程序不是jar文件嗎? –
我不是自己執行一個jar文件,我正在執行hadoop命令並告訴hadoop執行聲明的jar文件。庫路徑後面的其餘部分是與hadoop-streaming.jar相關的參數,並與執行的MapReduce操作相關。是的,jar文件是java程序 –