現在我已經通過log4j
日誌試圖確定我的Hadoop作業崩潰的原因。確定代碼中Hadoop錯誤的原因,因爲標準日誌尚未確定:文件拆分,容器內存或塊大小
基本上,作業試圖做的是在底層機器上發出一個命令,並收集該命令的輸出 - 此刻,所有這些步驟都發生在一個地圖作業中(後面我會嘗試減少這些個別產出的總和)。
我遇到的行爲是對但─一定數量的產生到BufferedReader
輸出,爲他們的conversation- 28
着想,一切工作正常,工作幾乎立即結束,但當我增加編號爲29
,地圖作業掛在67%
完成 - 嘗試三次 - 始終停在67%
,並最終因缺乏進度而自行終止。
從哪裏工作發出我們可以看到下面的輸出的NameNode:
17/10/09 15:19:29 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/10/09 15:19:29 INFO input.FileInputFormat: Total input files to process : 1
17/10/09 15:19:30 INFO mapreduce.JobSubmitter: number of splits:1
17/10/09 15:19:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1507562353923_0001
17/10/09 15:19:30 INFO impl.YarnClientImpl: Submitted application application_1507562353923_0001
17/10/09 15:19:30 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1507562353923_0001/
17/10/09 15:19:30 INFO mapreduce.Job: Running job: job_1507562353923_0001
17/10/09 15:19:36 INFO mapreduce.Job: Job job_1507562353923_0001 running in uber mode : false
17/10/09 15:19:36 INFO mapreduce.Job: map 0% reduce 0%
17/10/09 15:19:53 INFO mapreduce.Job: map 67% reduce 0%
17/10/09 15:30:05 INFO mapreduce.Job: Task Id : attempt_1507562353923_0001_m_000000_0, Status : FAILED
AttemptID:attempt_1507562353923_0001_m_000000_0 Timed out after 600 secs
Sent signal OUTPUT_THREAD_DUMP (SIGQUIT) to pid 6230 as user ubuntu for container container_1507562353923_0001_01_000002, result=success
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
17/10/09 15:30:06 INFO mapreduce.Job: map 0% reduce 0%
17/10/09 15:30:26 INFO mapreduce.Job: map 67% reduce 0%
17/10/09 15:40:36 INFO mapreduce.Job: Task Id : attempt_1507562353923_0001_m_000000_1, Status : FAILED
AttemptID:attempt_1507562353923_0001_m_000000_1 Timed out after 600 secs
17/10/09 15:40:37 INFO mapreduce.Job: map 0% reduce 0%
17/10/09 15:40:52 INFO mapreduce.Job: map 67% reduce 0%
17/10/09 15:51:05 INFO mapreduce.Job: Task Id : attempt_1507562353923_0001_m_000000_2, Status : FAILED
AttemptID:attempt_1507562353923_0001_m_000000_2 Timed out after 600 secs
17/10/09 15:51:06 INFO mapreduce.Job: map 0% reduce 0%
17/10/09 15:51:24 INFO mapreduce.Job: map 67% reduce 0%
17/10/09 16:01:37 INFO mapreduce.Job: map 100% reduce 100%
17/10/09 16:01:37 INFO mapreduce.Job: Job job_1507562353923_0001 failed with state FAILED due to: Task failed task_1507562353923_0001_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
17/10/09 16:01:37 INFO mapreduce.Job: Counters: 13
Job Counters
Failed map tasks=4
Killed reduce tasks=1
Launched map tasks=4
Other local map tasks=3
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5025782
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=2512891
Total time spent by all reduce tasks (ms)=0
Total vcore-milliseconds taken by all map tasks=2512891
Total vcore-milliseconds taken by all reduce tasks=0
Total megabyte-milliseconds taken by all map tasks=10292801536
Total megabyte-milliseconds taken by all reduce tasks=0
要確定這個問題,我已經花了很多時間精心澆過log4j
輸出日誌的根本原因,但他們不要把任何東西conclusive-爲實例 - 他們尋找就業機會SUCESSFUL和不成功的工作幾乎是相同的,與未成年人需要提醒的是不成功的工作,突然終止,像這樣:
最有趣的部分似乎是在這裏(從以前的嘗試之一(Hadoop的總是讓3次嘗試在完成作業)):
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opAdd, Time: 214
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opAdd, Time: 450
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opAdd, Time: 217
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opStop, Time: 165
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: Starting flush of map output
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 1977; bufvoid = 104857600
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
(&container_1507562353923_0001_01_000005???stderr0stdout105912017-10-09 16:01:35
Full thread dump OpenJDK 64-Bit Server VM (25.131-b11 mixed mode):
引領我的單詞「全螺紋」相信我MapOutputBuffer
正在填充,如this article所述,我嘗試了建議解決方案增加我的配置文件中的大小,即增加mapreduce.task.io.sort.mb
,到1000
,mapreduce.map.sort.spill.percent
到0.99
,但它沒有正面影響。
我嘗試的下一件事是從log4j以Log
的形式輸出debug/sanity檢查輸出來改變程序,而不是將它寫入文件decribed here,但那也不起作用。
此刻我的東西可能是錯的這個計劃思路如下:
1)MapOutputBuffer
佔滿了導致程序拖延了
是我應該提到的是, - 將一個文件作爲原子映射任務執行很重要 - 如果文件中的命令被分解爲較小的組件,則它向底層操作系統發出的操作將無效 - 因此:
2 )輸入t文件在HDFS內部被分解並且沒有作爲一個順序單元以正確的順序被讀入
3)命令在容器之間被分解並且該命令的該部分在一個容器中被髮出,而另一個容器照顧另一部分 - 從而導致我剛剛指出的問題。
所以後來的日誌尚無定論我的下一個目標是改變我的代碼以某種方式,這樣我可以分離,檢驗,並消除這些可能性每個 - >上的任何想法如何我可以以有效的方式來做到這一點 - 或者任何和所有評論,見解,關於我在這裏寫的任何其他內容的評論都將非常感謝。
這裏的地圖任務是如何在此刻寫着:
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
// declare logger
private final Logger LOG = org.apache.log4j.Logger.getLogger(this.getClass());
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
// output data struct
List<Map.Entry<String,Integer>> pairList = new ArrayList<>();
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
try {
// command execution
Runtime rt = Runtime.getRuntime();
String evmDir = "/home/ubuntu/go/src/github.com/ethereum/go-ethereum/build/bin/evm";
String command = evmDir + " --debug --code " + value.toString() + " run";
Process proc = Runtime.getRuntime().exec(command);
LOG.warn(command);
BufferedReader stdInput = new BufferedReader(new InputStreamReader(proc.getInputStream()));
BufferedReader stdError = new BufferedReader(new InputStreamReader(proc.getErrorStream()));
// define and initialise representation to hold 'evm' command output
ArrayList<String> consoleOutput = new ArrayList<String>();
ArrayList<String> debugOutput = new ArrayList<String>();
String s = null;
while ((s = stdInput.readLine()) != null) {
consoleOutput.add(s);
}
while ((s = stdError.readLine()) != null) {
debugOutput.add(s);
}
for (String p : consoleOutput) {
Pattern pattern = Pattern.compile("([A-Za-z]+)([ \t]+)(\\d+)");
Matcher matcher = pattern.matcher(p);
while (matcher.find()) {
String opcodeName = matcher.group(1);
Integer executionStepTime = Integer.valueOf(matcher.group(3));
// add component pieces of line to output data structure
pairList.add(new AbstractMap.SimpleEntry<>(opcodeName, executionStepTime));
}
}
} catch (IOException e) {
//LOG.warn(e);
LOG.warn("Exception Encountered! " + e);
}
// log output for debugging
for (Map.Entry<String, Integer> entry : pairList) {
String opcodeRep = entry.getKey().toString();
Integer stepTime = entry.getValue();
LOG.warn("Code: " + opcodeRep + ", Time: " + stepTime);
}
word.set(itr.nextToken());
context.write(word, one);
}
}
}
也許從日誌本addtional信息,看起來可能是恰當的正確診斷此問題:
"VM Thread" os_prio=0 tid=0x00007f0a1007a000 nid=0x7328 runnable
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f0a10026000 nid=0x7326 runnable
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f0a10027800 nid=0x7327 runnable
"VM Periodic Task Thread" os_prio=0 tid=0x00007f0a100bc800 nid=0x732f waiting on condition
JNI global references: 277
Heap
PSYoungGen total 100352K, used 7502K [0x0000000780000000, 0x0000000788a00000, 0x00000007c0000000)
eden space 94720K, 2% used [0x0000000780000000,0x0000000780216d78,0x0000000785c80000)
from space 5632K, 95% used [0x0000000788480000,0x00000007889bcbc8,0x0000000788a00000)
to space 8192K, 0% used [0x0000000787a00000,0x0000000787a00000,0x0000000788200000)
ParOldGen total 143360K, used 113628K [0x0000000700000000, 0x0000000708c00000, 0x0000000780000000)
object space 143360K, 79% used [0x0000000700000000,0x0000000706ef71d0,0x0000000708c00000)
Metaspace used 25981K, capacity 26224K, committed 26496K, reserved 1073152K
class space used 3019K, capacity 3083K, committed 3200K, reserved 1048576K
所以,我很困惑,如何做到這一點...上的從節點T帽子正在執行這個工作我試着先用'jps'找到'pid',然後我得到了如下輸出:'6432 DataNode 7986 Jps 7826 YarnChild 7876 YarnChild 6573 NodeManager',然後我試着對這兩個執行你的命令'pid',以這種方式'出口PID = 7826我在$(seq 1 10);做回聲「Jstack Iteration $ i」; jstack $ PID> /tmp/hungtask-hostname-${PID}.jstack.$i;睡5s; done tar zcvf hungtask.tar.gz/tmp/hungtask-hostname - $ {PID} .jstack。*',但是我得到錯誤'-bash:語法錯誤附近的意外令牌'do'' –
man,你覺得怎麼樣這一個https://stackoverflow.com/questions/46721969/g%C3%B6del-escher-bach-the-epiphenomenological-hadoop-job-fail-even-in-uber-mod –