Hadoop減少從不爲大數據創建的輸出文件

我正在用Hadoop 1.1.1（Ubuntu）編寫Java應用程序，該應用程序比較字符串以查找最長的常見子字符串。我已經成功地爲小數據集運行了map和reduce階段。每當我增加輸入的大小時，我的減少輸出就不會出現在我的目標輸出目錄中。它並沒有抱怨這一切使得這個所有的怪物。我在Eclipse中運行一切，我有1個映射器和1個reducer。Hadoop減少從不爲大數據創建的輸出文件

我的reducer找到一個字符串集合中最長的公共子字符串，然後發出子字符串作爲鍵和包含它作爲值的字符串的索引。我有一個簡短的例子。

輸入數據

0: ALPHAA 

1: ALPHAB 

2: ALZHA

輸出涌出

Key: ALPHA Value: 0 

Key: ALPHA Value: 1 

Key: AL Value: 0 

Key: AL Value: 1 

Key: AL Value: 2

前兩個輸入字符串兩者共享「ALPHA」作爲共同子串，而所有三個份額「AL」。當流程完成時，我最終將索引子串並將其寫入數據庫。

另外一個觀察，我可以看到中間文件是在我的輸出目錄中創建的，它只是減少的數據永遠不會放入輸出文件。

我粘貼了下面的Hadoop輸出日誌，它聲稱它有一些來自reducer的輸出記錄，只是它們似乎消失了。任何建議表示讚賞。

Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
Use GenericOptionsParser for parsing the arguments. Applications should implement Tool  for the same. 
No job jar file set. User classes may not be found. See JobConf(Class) or  JobConf#setJar(String). 
Total input paths to process : 1 
Running job: job_local_0001 
setsid exited with exit code 0 
Using ResourceCalculatorPlugin :  [email protected] 
Snappy native library not loaded 
io.sort.mb = 100 
data buffer = 79691776/99614720 
record buffer = 262144/327680 
map 0% reduce 0% 
Spilling map output: record full = true 
bufstart = 0; bufend = 22852573; bufvoid = 99614720 
kvstart = 0; kvend = 262144; length = 327680 
Finished spill 0 
Starting flush of map output 
Finished spill 1 
Merging 2 sorted segments 
Down to the last merge-pass, with 2 segments left of total size: 28981648 bytes 

Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 

Task attempt_local_0001_m_000000_0 done. 
Using ResourceCalculatorPlugin :  [email protected] 

Merging 1 sorted segments 
Down to the last merge-pass, with 1 segments left of total size: 28981646 bytes 

map 100% reduce 0% 
reduce > reduce 
map 100% reduce 66% 
reduce > reduce 
map 100% reduce 67% 
reduce > reduce 
reduce > reduce 
map 100% reduce 68% 
reduce > reduce 
reduce > reduce 
reduce > reduce 
map 100% reduce 69% 
reduce > reduce 
reduce > reduce 
map 100% reduce 70% 
reduce > reduce 
job_local_0001 
Job complete: job_local_0001 
Counters: 22 
    File Output Format Counters 
    Bytes Written=14754916 
    FileSystemCounters 
    FILE_BYTES_READ=61475617 
    HDFS_BYTES_READ=97361881 
    FILE_BYTES_WRITTEN=116018418 
    HDFS_BYTES_WRITTEN=116746326 
    File Input Format Counters 
    Bytes Read=46366176 
    Map-Reduce Framework 
    Reduce input groups=27774 
    Map output materialized bytes=28981650 
    Combine output records=0 
    Map input records=4629524 
    Reduce shuffle bytes=0 
    Physical memory (bytes) snapshot=0 
    Reduce output records=832559 
    Spilled Records=651304 
    Map output bytes=28289481 
    CPU time spent (ms)=0 
    Total committed heap usage (bytes)=2578972672 
    Virtual memory (bytes) snapshot=0 
    Combine input records=0 
    Map output records=325652 
    SPLIT_RAW_BYTES=136 
    Reduce input records=27774 
reduce > reduce 
reduce > reduce

來源

2013-05-14 mj_

想法，也許你有你的代碼中的錯誤，大運行中的一些字符串暴露？ – greedybuddha

我認爲你是對的。我剛剛發現一個例外。 –

我把我的減少（）和map（）邏輯與catch塊遞增計數器，其組是「異常」，其名稱是例外消息try-catch塊內。這給了我一個快速的方法（通過查看計數器列表）來查看拋出什麼異常（如果有的話）。

來源

2013-05-17 13:32:10

Hadoop減少從不爲大數據創建的輸出文件

回答

相關問題