2012-05-27 48 views
1

我正在嘗試使用Python進行Hadoop流式傳輸。我已經寫了簡單的地圖,並通過here失敗的地圖任務數量超出允​​許的限制

map腳本以幫助減少腳本如下:

#!/usr/bin/env python 

import sys, urllib, re 

title_re = re.compile("<title>(.*?)</title>", re.MULTILINE | re.DOTALL | re.IGNORECASE) 

for line in sys.stdin: 
    url = line.strip() 
    match = title_re.search(urllib.urlopen(url).read()) 
    if match : 
     print url, "\t", match.group(1).strip() 

reduce腳本如下:

#!/usr/bin/env python 

from operator import itemgetter 
import sys 

for line in sys.stdin : 
    line = line.strip() 
    print line 

使用Hadoop運行這些腳本後流動罐,map任務完成,我可以看到他們100%完成,但reduce工作卡住了22%,經過很長一段時間後,它給了ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.錯誤。

我無法找出背後的確切原因。

我的終端窗口看起來如下:

[email protected]:/host/Shekhar/Softwares/hadoop-1.0.0$ hadoop jar contrib/streaming/hadoop-streaming-1.0.0.jar -mapper /host/Shekhar/HadoopWorld/MultiFetch.py -reducer /host/Shekhar/HadoopWorld/reducer.py -input /host/Shekhar/HadoopWorld/urls/* -output /host/Shekhar/HadoopWorld/titles3 
Warning: $HADOOP_HOME is deprecated. 

packageJobJar: [/tmp/hadoop-shekhar/hadoop-unjar2709939812732871143/] [] /tmp/streamjob1176812134999992997.jar tmpDir=null 
12/05/27 11:27:46 INFO util.NativeCodeLoader: Loaded the native-hadoop library 
12/05/27 11:27:46 INFO mapred.FileInputFormat: Total input paths to process : 3 
12/05/27 11:27:46 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-shekhar/mapred/local] 
12/05/27 11:27:46 INFO streaming.StreamJob: Running job: job_201205271050_0006 
12/05/27 11:27:46 INFO streaming.StreamJob: To kill this job, run: 
12/05/27 11:27:46 INFO streaming.StreamJob: /host/Shekhar/Softwares/hadoop-1.0.0/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201205271050_0006 
12/05/27 11:27:46 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201205271050_0006 
12/05/27 11:27:47 INFO streaming.StreamJob: map 0% reduce 0% 
12/05/27 11:28:07 INFO streaming.StreamJob: map 67% reduce 0% 
12/05/27 11:28:37 INFO streaming.StreamJob: map 100% reduce 0% 
12/05/27 11:28:40 INFO streaming.StreamJob: map 100% reduce 11% 
12/05/27 11:28:49 INFO streaming.StreamJob: map 100% reduce 22% 
12/05/27 11:31:35 INFO streaming.StreamJob: map 67% reduce 22% 
12/05/27 11:31:44 INFO streaming.StreamJob: map 100% reduce 22% 
12/05/27 11:34:52 INFO streaming.StreamJob: map 67% reduce 22% 
12/05/27 11:35:01 INFO streaming.StreamJob: map 100% reduce 22% 
12/05/27 11:38:11 INFO streaming.StreamJob: map 67% reduce 22% 
12/05/27 11:38:20 INFO streaming.StreamJob: map 100% reduce 22% 
12/05/27 11:41:29 INFO streaming.StreamJob: map 67% reduce 22% 
12/05/27 11:41:35 INFO streaming.StreamJob: map 100% reduce 100% 
12/05/27 11:41:35 INFO streaming.StreamJob: To kill this job, run: 
12/05/27 11:41:35 INFO streaming.StreamJob: /host/Shekhar/Softwares/hadoop-1.0.0/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201205271050_0006 
12/05/27 11:41:35 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201205271050_0006 
12/05/27 11:41:35 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201205271050_0006_m_000001 
12/05/27 11:41:35 INFO streaming.StreamJob: killJob... 
Streaming Job Failed! 

誰能幫我?

編輯 作業服務器詳情如下:

Hadoop job_201205271050_0006 on localhost 

User: shekhar 
Job Name: streamjob1176812134999992997.jar 
Job File: file:/tmp/hadoop-shekhar/mapred/staging/shekhar/.staging/job_201205271050_0006/job.xml 
Submit Host: ubuntu 
Submit Host Address: 127.0.1.1 
Job-ACLs: All users are allowed 
Job Setup: Successful 
Status: Failed 
Failure Info:# of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201205271050_0006_m_000001 
Started at: Sun May 27 11:27:46 IST 2012 
Failed at: Sun May 27 11:41:35 IST 2012 
Failed in: 13mins, 48sec 
Job Cleanup: Successful 
Black-listed TaskTrackers: 1 
Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed 
Task Attempts 
map 100.00% 
3 0 0 2 1 4/0 
reduce 100.00% 
1 0 0 0 1 0/1 
+1

轉到跟蹤URL爲http://本地主機:50030/jobdetails.jsp ?jobid = job_201205271050_0006找出實際的錯誤 –

+0

@ Raze2dust,我打開該網址,但也有同樣的錯誤... – Shekhar

+0

你檢查了失敗的單個任務的stdout/stderr日誌嗎? –

回答

3

這個錯誤只是一個一般性錯誤,有太多的Map任務未能完成:

未能地圖任務的

超過允許限度

可以使用EMR控制檯導航到記錄單個Map/Reduce任務。 然後你應該能夠看到問題是什麼。

在我的情況 - 當我犯了一個小錯誤,比如錯誤地設置了Map腳本的路徑時,我得到了這個錯誤。

步驟查看任務的日誌:

http://antipatterns.blogspot.nl/2013/03/amazon-emr-map-reduce-error-of-failed.html

2

我只是有同樣的錯誤出現。在我看來,這是一個解析錯誤。在stdin分割線的地方有一個「意外的」新線。我會建議檢查你的數據文件。一旦我刪除了有這些新線的列,它工作得很好。

0

首先檢查你的標準錯誤。 您的信息是不夠的,決定什麼樣的錯誤是,標準錯誤通常是: {您的Hadoop的臨時目錄這裏}/mapred /本地/ userlogs/{您的作業id}/{你學嘗試ID} /標準錯誤

Sean的當你第一次使用hadoop時,回答是最多的情況,所以我猜你可能會得到'env:python \ r:沒有這樣的文件或目錄'錯誤。如果是這樣,只需將您的CR替換爲LF即可解決此問題。只是運行一個腳本\ n替換\ r

0

在您的映射和減速的開頭添加以下行:

#!/usr/bin/python 
相關問題