使用分佈式緩存獲取分發小型查找文件的最佳方式

獲取分佈式緩存數據的最佳方式是哪種？使用分佈式緩存獲取分發小型查找文件的最佳方式

public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> { 

    ArrayList<String> globalFreq = new ArrayList<String>(); 
    public void setup(Context context) throws IOException{ 
     Configuration conf = context.getConfiguration(); 
     FileSystem fs = FileSystem.get(conf); 
     URI[] cacheFiles = DistributedCache.getCacheFiles(conf); 
     Path getPath = new Path(cacheFiles[0].getPath()); 
     BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath))); 
     String setupData = null; 
     while ((setupData = bf.readLine()) != null) { 
      String [] parts = setupData.split(" "); 
      globalFreq.add(parts[0]); 
     } 
    } 
    public void map(LongWritable key, Text value, Context context) 
      throws IOException, InterruptedException { 
     //Accessing "globalFreq" data .and do further processing 
     }

public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> { 
    URI[] cacheFiles 
    public void setup(Context context) throws IOException{ 
     Configuration conf = context.getConfiguration(); 
     FileSystem fs = FileSystem.get(conf); 
     cacheFiles = DistributedCache.getCacheFiles(conf); 

    } 
    public void map(LongWritable key, Text value, Context context) 
      throws IOException, InterruptedException { 
     ArrayList<String> globalFreq = new ArrayList<String>(); 
     Path getPath = new Path(cacheFiles[0].getPath()); 
     BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath))); 
     String setupData = null; 
     while ((setupData = bf.readLine()) != null) { 
      String [] parts = setupData.split(" "); 
      globalFreq.add(parts[0]); 
     } 

     }

因此，如果我們正在做的一樣（碼2）意思Say we have 5 map task every map task reads the same copy of the data。當爲每張地圖寫這樣的內容時，任務在我正確的時候多次讀取數據（5次）？代碼1：因爲它是在安裝程序中寫入的，所以只讀取一次，全局數據在地圖中訪問。

這是寫分佈式緩存的正確方法。

來源

2014-09-10 Unmesha SreeVeni

在setup方法中儘可能地做到這一點：這將由每個映射器調用一次，但是將被傳遞給映射器的每個記錄緩存。解析每條記錄的數據是可以避免的開銷，因爲沒有任何內容取決於您在map方法中收到的key，value和context變量。

的setup方法將每個地圖被稱爲任務，但map將被稱爲用於每個記錄由任務（其可以顯然是非常高的數字）進行處理。

來源

2014-09-10 08:39:51 davek

所以，最好去代碼1我是對的嗎？第二個是直截了當的方式嗎？由於分佈式緩存的文檔說「每個節點將訪問數據副本」https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/filecache/DistributedCache.html – 2014-09-10 08:48:00

我一定會去有了第一個選擇：你無法避免每個任務都必須解析緩存內容一次，但一旦完成，你可以避免再次爲每個記錄重做。 – davek 2014-09-10 08:48:32

如果緩存數據量過大，會發生什麼情況。不能以列表或其他方式存儲。有可能存在需要獲取大量數據的情況。例如：（如果我沒有錯，請糾正我，如果我錯了）KNN算法。它的模型是相同的輸入數據。雖然預測我們需要爲這些情況獲取模型數據，但我們不能依賴代碼1，因爲它可能會捕獲堆空間 – 2014-09-10 08:54:19

使用分佈式緩存獲取分發小型查找文件的最佳方式

回答

相關問題