在Hadoop中同時在同一文件中使用兩個映射器

假設有一個文件和兩個不同的獨立映射器在該文件上並行執行。爲此，我們需要使用該文件的副本。在Hadoop中同時在同一文件中使用兩個映射器

我想知道的是「是否可以爲兩個映射器使用相同的文件」，這反過來將減少資源利用率並使系統時間有效。

有沒有在這方面的任何研究或在Hadoop任何現有的工具，可以幫助克服這一點。

2013-05-27 Anil Kumar

不知道你在問什麼。一個大文件被分成多個塊，無論如何，每個塊都由不同的映射器處理。你能澄清你的問題嗎？ – jkovacs

你的兩個不同的獨立映射器的輸出是什麼？如果類型相同。將兩個映射器打包成一個映射器很容易。 – zsxwing

假設都映射器具有相同的K,V簽名，你可以使用一個委託映射，然後調用你的兩個映射器的映射方法：

public class DelegatingMapper extends Mapper<LongWritable, Text, Text, Text> { 
    public Mapper<LongWritable, Text, Text, Text> mapper1; 
    public Mapper<LongWritable, Text, Text, Text> mapper2; 

    protected void setup(Context context) { 
     mapper1 = new MyMapper1<LongWritable, Text, Text, Text>(); 
     mapper1.setup(context); 

     mapper2 = new MyMapper1<LongWritable, Text, Text, Text>(); 
     mapper2.setup(context); 
    } 

    public void map(LongWritable key, Text value, Context context) { 
     // your map methods will need to be public for each class 
     mapper1.map(key, value, context); 
     mapper2.map(key, value, context); 
    } 

    protected void cleanup(Context context) { 
     mapper1.cleanup(context); 
     mapper2.cleanup(context); 
    } 
}

來源

2013-05-27 11:04:47

在一個較高水平，有2個場景我能想象與問題在手。

案例1：

如果你試圖寫在這兩個映射類SAME落實處理與有效利用資源的唯一目的是相同的輸入文件，這可能是不正確的做法。因爲，當文件保存在集羣中時，它將被分割成塊並跨數據節點進行復制。這基本上爲您提供了最高效的資源利用率，因爲同一個輸入文件的所有數據塊都在PARALLEL中處理。

案例2：

如果您正在嘗試寫兩個不同映射實現（用自己的業務邏輯），對於某些特定的工作流程，你要根據自己的業務需求來執行。是的，您可以使用MultipleInputs類將相同的輸入文件傳遞給兩個不同的映射器。

MultipleInputs.addInputPath(job, file1, TextInputFormat.class, Mapper1.class); 
MultipleInputs.addInputPath(job, file1, TextInputFormat.class, Mapper2.class);

這可能只是一個基於想要實現的解決方法。

謝謝。

來源

2016-05-20 15:26:27 naveenkumarbv

在Hadoop中同時在同一文件中使用兩個映射器

回答

相關問題