我有一個五節點集羣,其中三個節點包含DataNode和TaskTracker。在Oozie工作流中設置MapReduce作業中的Reducer數量
我通過Sqoop從Oracle導入了大約1000萬行數據行,並通過Oozie工作流程中的MapReduce進行處理。
MapReduce作業大約需要30分鐘,只使用一個reducer。
編輯 - 如果我自己運行MapReduce代碼,與Oozie分開,job.setNumReduceTasks(4)
正確地建立了4個reducer。
我曾嘗試下面的方法來減速的數量手動設置爲四個,但沒有成功:
在Oozie的,在地圖上的標籤設置以下屬性減少節點:
<property><name>mapred.reduce.tasks</name><value>4</value></property>
在MapReduce的Java代碼的主要方法:
Configuration conf = new Configuration();
Job job = new Job(conf, "10 million rows");
...
job.setNumReduceTasks(4);
我也試過:
Configuration conf = new Configuration();
Job job = new Job(conf, "10 million rows");
...
conf.set("mapred.reduce.tasks", "4");
我的地圖功能類似於此:
public void map(Text key, Text value, Context context) {
CustomObj customObj = new CustomObj(key.toString());
context.write(new Text(customObj.getId()), customObj);
}
我想有類似的ID 80000個不同的值。
我Reduce函數類似於此:
public void reduce(Text key, Iterable<CustomObj> vals, Context context) {
OtherCustomObj otherCustomObj = new OtherCustomObj();
...
context.write(null, otherCustomObj);
}
在映射器發出的定製對象實現WritableComparable,但其他自定義反對發出的減速沒有實現WritableComparable。
以下是有關係統計數器,作業計數器和map-reduce框架的日誌,其中指定只啓動一個reduce任務。
map 100% reduce 100%
Job complete: job_201401131546_0425
Counters: 32
File System Counters
FILE: Number of bytes read=1370377216
FILE: Number of bytes written=2057213222
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=556345690
HDFS: Number of bytes written=166938092
HDFS: Number of read operations=18
HDFS: Number of large read operations=0
HDFS: Number of write operations=1
Job Counters
Launched map tasks=11
Launched reduce tasks=1
Data-local map tasks=11
Total time spent by all maps in occupied slots (ms)=1268296
Total time spent by all reduces in occupied slots (ms)=709774
Total time spent by all maps waiting after reserving slots (ms)=0
Total time spent by all reduces waiting after reserving slots (ms)=0
Map-Reduce Framework
Map input records=9440000
Map output records=9440000
Map output bytes=666308476
Input split bytes=1422
Combine input records=0
Combine output records=0
Reduce input groups=80000
Reduce shuffle bytes=685188530
Reduce input records=9440000
Reduce output records=2612760
Spilled Records=28320000
CPU time spent (ms)=1849500
Physical memory (bytes) snapshot=3581157376
Virtual memory (bytes) snapshot=15008251904
Total committed heap usage (bytes)=2848063488
編輯:我修改了MapReduce的引入自定義分區,一種比較,和分組比較。出於某種原因,代碼現在啓動兩個reducer(通過Oozie計劃時),但不是四個。
我將mapred.tasktracker.map.tasks.maximum
屬性設置爲20在每個TaskTracker(和JobTracker)上,重新啓動它們但沒有結果。
手動設置自定義分區到4以下屬性的值,在實現方法分離基於某些條件分爲4個部分的ID 。這只是爲了測試4個分區/縮減器是否正在執行。 –
你正在使用的Hadoop版本是什麼?檢查您用於設置Reducer的屬性是否對該版本有效 –