2013-05-20 159 views
1

我在約500個節點的羣集上運行Hadoop版本1.0.0。 我的工作有大約3000個地圖任務和10個減少任務。 約4小時後(如預期的)地圖任務完成。 每個完成的reduce任務之後不久,結果都在我的輸出目錄中可用。然而,jobtracker然後認爲一些地圖任務失敗並開始重新執行它們。執行和待處理的減少任務的數量保持爲零。 最終大約8小時後,這些地圖任務中的最後一個最終成功完成,並且作業被標記爲成功完成。完成減少任務後Hadoop - 映射任務繼續

任何想法???


以下是一些JobTracker的日誌文件的摘錄:

// map tasks all complete, eg: 
2013-05-20 10:50:59,742 INFO org.apache.hadoop.mapred.JobInProgress: 
Task 'attempt_201305131710_0007_m_000430_0' has completed task_201305131710_0007_m_000430 successfully. 

//reduce tasks all complete: 

2013-05-20 13:38:34,040 INFO org.apache.hadoop.mapred.JobInProgress:   Task 'attempt_201305131710_0007_r_000009_0' has completed task_201305131710_0007_r_000009 successfully. 
2013-05-20 13:38:34,142 INFO org.apache.hadoop.mapred.JobInProgress:  
Task 'attempt_201305131710_0007_r_000004_0' has completed task_201305131710_0007_r_000004 successfully. 
2013-05-20 13:38:34,204 INFO org.apache.hadoop.mapred.JobInProgress:  
Task 'attempt_201305131710_0007_r_000008_0' has completed task_201305131710_0007_r_000008 successfully. 
2013-05-20 13:38:34,745 INFO org.apache.hadoop.mapred.JobInProgress: 
Task 'attempt_201305131710_0007_r_000002_0' has completed task_201305131710_0007_r_000002 successfully. 
2013-05-20 13:38:35,521 INFO org.apache.hadoop.mapred.JobInProgress:  
Task 'attempt_201305131710_0007_r_000003_0' has completed task_201305131710_0007_r_000003 successfully. 
2013-05-20 13:38:36,196 INFO org.apache.hadoop.mapred.JobInProgress: 
Task 'attempt_201305131710_0007_r_000007_0' has completed task_201305131710_0007_r_000007 successfully. 
2013-05-20 13:38:36,276 INFO org.apache.hadoop.mapred.JobTracker: Adding tracker tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:1295 to host HN301-1657.labs.edu.au 
2013-05-20 13:38:36,469 INFO org.apache.hadoop.mapred.JobInProgress: 
Task 'attempt_201305131710_0007_r_000005_0' has completed task_201305131710_0007_r_000005 successfully. 
2013-05-20 13:38:36,598 INFO org.apache.hadoop.mapred.JobInProgress: 
Task 'attempt_201305131710_0007_r_000006_0' has completed task_201305131710_0007_r_000006 successfully. 
2013-05-20 13:38:36,612 INFO org.apache.hadoop.mapred.JobInProgress: 
Task 'attempt_201305131710_0007_r_000000_0' has completed task_201305131710_0007_r_000000 successfully. 
2013-05-20 13:38:40,388 INFO org.apache.hadoop.mapred.JobInProgress: 
Task 'attempt_201305131710_0007_r_000001_0' has completed task_201305131710_0007_r_000001 successfully. 
2013-05-20 13:44:12,795 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker 'tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896' 

//As the reduce tasks are reporting success, the job tracker detects that one of the job trackers has died and so restarts it. 
//Each of the jobs previously completed successfully by that task tracker are then reexecuted 
2013-05-20 13:44:12,795 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_000430_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,795 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_000571_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,796 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_001612_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,796 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_001629_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,796 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_001892_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,796 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_002424_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,796 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_002437_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,796 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_002696_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,796 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_003130_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,796 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_003149_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,796 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_003187_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_003275_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_003358_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_003437_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_003451_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_003478_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0007_m_003506_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305131710_0010_m_000021_0: Lost task tracker: tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0010_m_000021_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_000430_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_000571_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_001612_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_001629_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_001892_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_002424_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_002437_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_002696_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_003130_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_003149_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_003187_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_003275_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_003358_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_003437_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_003451_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_003478_0' 
2013-05-20 13:44:12,797 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201305131710_0007_m_003506_0' 
2013-05-20 13:44:12,917 INFO org.apache.hadoop.mapred.JobTracker: Adding task (TASK_CLEANUP) 'attempt_201305131710_0010_m_000021_0' to tip task_201305131710_0010_m_000021, for tracker 'tracker_HN301-1654.labs.edu.au:127.0.0.1/127.0.0.1:1100' 
2013-05-20 13:44:13,760 INFO org.apache.hadoop.mapred.JobInProgress: Choosing a failed task task_201305131710_0007_m_000430 
2013-05-20 13:44:13,761 INFO org.apache.hadoop.mapred.JobTracker: Adding task (MAP) 'attempt_201305131710_0007_m_000430_1' to tip task_201305131710_0007_m_000430, for tracker 'tracker_ZC329-0001.labs.edu.au:127.0.0.1/127.0.0.1:1113' 

回答

0

您可能要檢查每個集羣節點的配置:

tracker_HN301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 

JobTracker將試圖通過回送地址與TaskTracker節點聯繫。檢查每個節點/ etc/hosts文件的內容以檢查它們是否正確(並且最好了解羣集中的每個其他節點,以便避免DNS查找成本)。

我不是說這是你問題的原因,但它肯定是不對的,應該是你追查的東西

相關問題