我有一組文件說10個文件和一個大文件,這是所有10個文件的總和。閱讀許多文件hadoop mapreduce分佈式緩存



  1. 我讀這是在分佈式緩存添加在減少方法只選定的文件。我預計速度會更快,因爲在每個縮減中讀取的文件大小與在所有縮小方法中讀取大型文件相比較小。但是,速度較慢。

  2. 此外,當我將它分割成更小的文件並將它們添加到分佈式緩存時,問題變得更糟。工作本身在很長一段時間纔開始運行。





您可以編寫類似: 使用新的MapReduce API(ORG .apache.hadoop.mapreduce *) -

public static class ReduceJob extends Reducer<Text, Text, Text, Text> { 

Path file1; 
Path file2; 

      protected void setup(Context context) throws IOException, InterruptedException { 

       // Get the file from distributed cached 
    file1 = DistributedCache.getLocalCacheFiles(context.getConfiguration())[0]; 
    file2 = DistributedCache.getLocalCacheFiles(context.getConfiguration())[1]; 

       // parse the file and get it's data in-memory for use in reduce method, probably in some ArrayList or HashMap. 

      protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, 
        InterruptedException { 

使用舊的mapred API(org.apache.hadoop.mapred *) -

public static class ReduceJob extends MapReduceBase implements Reducer<Text, Text, Text, Text> { 

Path file1; 
Path file2; 

     public void configure(JobConf job) { 

       // Get the file from distributed cached 
    file1 = DistributedCache.getLocalCacheFiles(job)[0] 
    file2 = DistributedCache.getLocalCacheFiles(job)[1] 

       // parse the file and get it's data in-memory for use in reduce method, probably in some ArrayList or HashMap. 

     public synchronized void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, 
       Reporter reporter) throws IOException { 