2017-10-09 26 views
0

現在我已經通過log4j日誌試圖確定我的Hadoop作業崩潰的原因。確定代碼中Hadoop錯誤的原因,因爲標準日誌尚未確定:文件拆分,容器內存或塊大小

基本上,作業試圖做的是在底層機器上發出一個命令,並收集該命令的輸出 - 此刻,所有這些步驟都發生在一個地圖作業中(後面我會嘗試減少這些個別產出的總和)。

我遇到的行爲是對但─一定數量的產生到BufferedReader輸出,爲他們的conversation- 28着想,一切工作正常,工作幾乎立即結束,但我增加編號爲29,地圖作業掛在67%完成 - 嘗試三次 - 始終停在67%,並最終因缺乏進度而自行終止。

從哪裏工作發出我們可以看到下面的輸出的NameNode:

17/10/09 15:19:29 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 
17/10/09 15:19:29 INFO input.FileInputFormat: Total input files to process : 1 
17/10/09 15:19:30 INFO mapreduce.JobSubmitter: number of splits:1 
17/10/09 15:19:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1507562353923_0001 
17/10/09 15:19:30 INFO impl.YarnClientImpl: Submitted application application_1507562353923_0001 
17/10/09 15:19:30 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1507562353923_0001/ 
17/10/09 15:19:30 INFO mapreduce.Job: Running job: job_1507562353923_0001 
17/10/09 15:19:36 INFO mapreduce.Job: Job job_1507562353923_0001 running in uber mode : false 
17/10/09 15:19:36 INFO mapreduce.Job: map 0% reduce 0% 
17/10/09 15:19:53 INFO mapreduce.Job: map 67% reduce 0% 
17/10/09 15:30:05 INFO mapreduce.Job: Task Id : attempt_1507562353923_0001_m_000000_0, Status : FAILED 
AttemptID:attempt_1507562353923_0001_m_000000_0 Timed out after 600 secs 
Sent signal OUTPUT_THREAD_DUMP (SIGQUIT) to pid 6230 as user ubuntu for container container_1507562353923_0001_01_000002, result=success 
Container killed by the ApplicationMaster. 
Container killed on request. Exit code is 143 
Container exited with a non-zero exit code 143 

17/10/09 15:30:06 INFO mapreduce.Job: map 0% reduce 0% 
17/10/09 15:30:26 INFO mapreduce.Job: map 67% reduce 0% 
17/10/09 15:40:36 INFO mapreduce.Job: Task Id : attempt_1507562353923_0001_m_000000_1, Status : FAILED 
AttemptID:attempt_1507562353923_0001_m_000000_1 Timed out after 600 secs 
17/10/09 15:40:37 INFO mapreduce.Job: map 0% reduce 0% 
17/10/09 15:40:52 INFO mapreduce.Job: map 67% reduce 0% 
17/10/09 15:51:05 INFO mapreduce.Job: Task Id : attempt_1507562353923_0001_m_000000_2, Status : FAILED 
AttemptID:attempt_1507562353923_0001_m_000000_2 Timed out after 600 secs 
17/10/09 15:51:06 INFO mapreduce.Job: map 0% reduce 0% 
17/10/09 15:51:24 INFO mapreduce.Job: map 67% reduce 0% 
17/10/09 16:01:37 INFO mapreduce.Job: map 100% reduce 100% 
17/10/09 16:01:37 INFO mapreduce.Job: Job job_1507562353923_0001 failed with state FAILED due to: Task failed task_1507562353923_0001_m_000000 
Job failed as tasks failed. failedMaps:1 failedReduces:0 

17/10/09 16:01:37 INFO mapreduce.Job: Counters: 13 
    Job Counters 
     Failed map tasks=4 
     Killed reduce tasks=1 
     Launched map tasks=4 
     Other local map tasks=3 
     Data-local map tasks=1 
     Total time spent by all maps in occupied slots (ms)=5025782 
     Total time spent by all reduces in occupied slots (ms)=0 
     Total time spent by all map tasks (ms)=2512891 
     Total time spent by all reduce tasks (ms)=0 
     Total vcore-milliseconds taken by all map tasks=2512891 
     Total vcore-milliseconds taken by all reduce tasks=0 
     Total megabyte-milliseconds taken by all map tasks=10292801536 
     Total megabyte-milliseconds taken by all reduce tasks=0 

要確定這個問題,我已經花了很多時間精心澆過log4j輸出日誌的根本原因,但他們不要把任何東西conclusive-爲實例 - 他們尋找就業機會SUCESSFUL和不成功的工作幾乎是相同的,與未成年人需要提醒的是不成功的工作,突然終止,像這樣:

enter image description here

最有趣的部分似乎是在這裏(從以前的嘗試之一(Hadoop的總是讓3次嘗試在完成作業)):

2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opAdd, Time: 214 
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opAdd, Time: 450 
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opAdd, Time: 217 
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opStop, Time: 165 
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: Starting flush of map output 
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output 
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 1977; bufvoid = 104857600 
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600 
(&container_1507562353923_0001_01_000005???stderr0stdout105912017-10-09 16:01:35 
Full thread dump OpenJDK 64-Bit Server VM (25.131-b11 mixed mode): 

引領我的單詞「全螺紋」相信我MapOutputBuffer正在填充,如this article所述,我嘗試了建議解決方案增加我的配置文件中的大小,即增加mapreduce.task.io.sort.mb,到1000,mapreduce.map.sort.spill.percent0.99,但它沒有正面影響。

我嘗試的下一件事是從log4j以Log的形式輸出debug/sanity檢查輸出來改變程序,而不是將它寫入文件decribed here,但那也不起作用。


此刻我的東西可能是錯的這個計劃思路如下:

1)MapOutputBuffer佔滿了導致程序拖延了

是我應該提到的是, - 將一個文件作爲原子映射任務執行很重要 - 如果文件中的命令被分解爲較小的組件,則它向底層操作系統發出的操作將無效 - 因此:

2 )輸入t文件在HDFS內部被分解並且沒有作爲一個順序單元以正確的順序被讀入

3)命令在容器之間被分解並且該命令的該部分在一個容器中被髮出,而另一個容器照顧另一部分 - 從而導致我剛剛指出的問題。


所以後來的日誌尚無定論我的下一個目標是改變我的代碼以某種方式,這樣我可以分離,檢驗,並消除這些可能性每個 - >上的任何想法如何我可以以有效的方式來做到這一點 - 或者任何和所有評論,見解,關於我在這裏寫的任何其他內容的評論都將非常感謝。

這裏的地圖任務是如何在此刻寫着:

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ 

    private final static IntWritable one = new IntWritable(1); 
    private Text word = new Text(); 

    // declare logger 
    private final Logger LOG = org.apache.log4j.Logger.getLogger(this.getClass()); 

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException { 

     // output data struct 
     List<Map.Entry<String,Integer>> pairList = new ArrayList<>(); 

     StringTokenizer itr = new StringTokenizer(value.toString()); 
     while (itr.hasMoreTokens()) { 
      try { 
       // command execution 
       Runtime rt = Runtime.getRuntime(); 
       String evmDir = "/home/ubuntu/go/src/github.com/ethereum/go-ethereum/build/bin/evm"; 
       String command = evmDir + " --debug --code " + value.toString() + " run"; 
       Process proc = Runtime.getRuntime().exec(command); 

       LOG.warn(command); 

       BufferedReader stdInput = new BufferedReader(new InputStreamReader(proc.getInputStream())); 
       BufferedReader stdError = new BufferedReader(new InputStreamReader(proc.getErrorStream())); 

       // define and initialise representation to hold 'evm' command output 
       ArrayList<String> consoleOutput = new ArrayList<String>(); 
       ArrayList<String> debugOutput = new ArrayList<String>(); 

       String s = null; 
       while ((s = stdInput.readLine()) != null) { 
        consoleOutput.add(s); 
       } 
       while ((s = stdError.readLine()) != null) { 
        debugOutput.add(s); 
       } 

       for (String p : consoleOutput) { 
        Pattern pattern = Pattern.compile("([A-Za-z]+)([ \t]+)(\\d+)"); 
        Matcher matcher = pattern.matcher(p); 
        while (matcher.find()) { 
         String opcodeName = matcher.group(1); 
         Integer executionStepTime = Integer.valueOf(matcher.group(3)); 
         // add component pieces of line to output data structure 
         pairList.add(new AbstractMap.SimpleEntry<>(opcodeName, executionStepTime)); 
        } 
       } 

      } catch (IOException e) { 
       //LOG.warn(e); 
       LOG.warn("Exception Encountered! " + e); 
      } 

      // log output for debugging 
      for (Map.Entry<String, Integer> entry : pairList) { 
       String opcodeRep = entry.getKey().toString(); 
       Integer stepTime = entry.getValue(); 
       LOG.warn("Code: " + opcodeRep + ", Time: " + stepTime); 
      } 

      word.set(itr.nextToken()); 
      context.write(word, one); 
     } 
    } 
    } 

也許從日誌本addtional信息,看起來可能是恰當的正確診斷此問題:

"VM Thread" os_prio=0 tid=0x00007f0a1007a000 nid=0x7328 runnable 

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f0a10026000 nid=0x7326 runnable 

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f0a10027800 nid=0x7327 runnable 

"VM Periodic Task Thread" os_prio=0 tid=0x00007f0a100bc800 nid=0x732f waiting on condition 

JNI global references: 277 

Heap 
PSYoungGen  total 100352K, used 7502K [0x0000000780000000, 0x0000000788a00000, 0x00000007c0000000) 
    eden space 94720K, 2% used [0x0000000780000000,0x0000000780216d78,0x0000000785c80000) 
    from space 5632K, 95% used [0x0000000788480000,0x00000007889bcbc8,0x0000000788a00000) 
    to space 8192K, 0% used [0x0000000787a00000,0x0000000787a00000,0x0000000788200000) 
ParOldGen  total 143360K, used 113628K [0x0000000700000000, 0x0000000708c00000, 0x0000000780000000) 
    object space 143360K, 79% used [0x0000000700000000,0x0000000706ef71d0,0x0000000708c00000) 
Metaspace  used 25981K, capacity 26224K, committed 26496K, reserved 1073152K 
    class space used 3019K, capacity 3083K, committed 3200K, reserved 1048576K 

回答

2

要了解映射器恰好位於哪裏,可以使用jstack [獲取線程轉儲]。

Jstack附帶jdk,您可以在卡住的映射程序中使用它,如下所示。

Step0:在其中找到你的地圖任務正在運行的主機名,並記在TASK_ID

的第一步:登錄到節點並運行

的ps aux | grep的TASK_ID

標識進程ID,這是開始的/ usr/JAVA/JDK /斌/ java的

第二步進程的用戶名:su切換到進程所有者用戶名

第三步:出口的java家庭和bin路徑[例如:export JAVA_HOME =的/ usr/JAVA/jdk1.7.0_67 & &出口PATH = $ JAVA_HOME /斌:$ PATH]

第四步:更換PID帶PID你在步驟1中得到:

export PID = PID for i in $(seq 1 10);做回聲「Jstack Iteration $ i」; jstack $ PID>/tmp/hungtask- hostname - $ {PID} .jstack。$ i;睡5s;完成 tar zcvf hungtask.tar.gz/tmp/hungtask- hostname - $ {PID} .jstack。*

the hungtask.tar。gz將包含10次間隔5秒的進程的線程轉儲。您可能需要在任務進入掛起狀態時運行腳本。

之後,如果你可以上傳hungtask.tar.gz到這個主題,我可以看到並分享我的觀察。

也明白如果這個過程正在發生頻繁的GC,你可以試試下面的命令

jstat -gc -t PID STEP

PID是java進程的進程ID監測 一步是採樣時間步長

你可以粘貼內容的網站http://nix-on.blogspot.in/2015/01/java-jstat-how-to-visualize-garbage.html瞭解如果發生過度GC

+0

所以,我很困惑,如何做到這一點...上的從節點T帽子正在執行這個工作我試着先用'jps'找到'pid',然後我得到了如下輸出:'6432 DataNode 7986 Jps 7826 YarnChild 7876 YarnChild 6573 NodeManager',然後我試着對這兩個執行你的命令'pid',以這種方式'出口PID = 7826我在$(seq 1 10);做回聲「Jstack Iteration $ i」; jstack $ PID> /tmp/hungtask-hostname-${PID}.jstack.$i;睡5s; done tar zcvf hungtask.tar.gz/tmp/hungtask-hostname - $ {PID} .jstack。*',但是我得到錯誤'-bash:語法錯誤附近的意外令牌'do'' –

+0

man,你覺得怎麼樣這一個https://stackoverflow.com/questions/46721969/g%C3%B6del-escher-bach-the-epiphenomenological-hadoop-job-fail-even-in-uber-mod –