Hadoop Mapper的採樣記錄

我有一個數據集，其關鍵由3部分組成：a，b和c。在我的映射器中，我想發出記錄，其中鍵爲'a'，值爲'a，b，c'Hadoop Mapper的採樣記錄

如何發出檢測到的每個'a'的總記錄數的10％來自Hadoop中的映射器？是否應該考慮將以前的Map-Reduce作業中每個'a'所見的記錄總數保存在臨時文件中？

2013-06-19 syker

是否需要正好10％或接近10％？ – climbage

前者，但我很想聽到兩者的答案。我用後者猜測，它會類似於油藏採樣？ – syker

是的，這就是我的想法。否則，您可能需要計算地圖階段中的按鍵數量，然後使用該按鍵在縮小中僅放出10％。只是一個想法 – climbage

如果你想接近10％，你可以使用隨機。這裏是映射的一個例子：

public class Test extends Mapper<LongWritable, Text, LongWritable, Text> { 

    private Random r = new Random(); 

    @Override 
    public void map(LongWritable key, Text value, Context context) 
      throws IOException, InterruptedException { 
     if (r.nextInt(10) == 0) { 
      context.write(key, value); 
     } 
    } 

}

來源

2013-06-20 01:31:09 zsxwing

使用此Java代碼來隨機選擇10％：

import java.io.IOException; 
import java.util.*; 

import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.conf.*; 
import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapreduce.*; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 

public class RandomSample { 

public static class Map extends Mapper<LongWritable, Text, Text, Text> { 
    private Text word = new Text(); 

    public void map(LongWritable key, Text value, Context context) 
    throws IOException, InterruptedException { 
     if (Math.random()<0.1) 
      context.write(value,null); 
     else 
      context.write(null,null); 
    context.write(value,null); 
    } 
} 

public static void main(String[] args) throws Exception { 
    Configuration conf = new Configuration(); 

    Job job = new Job(conf, "randomsample"); 
    job.setJarByClass(RandomSample.class); 

    job.setOutputKeyClass(Text.class); 
    job.setOutputValueClass(IntWritable.class); 

    job.setOutputKeyClass(Text.class); 
    job.setOutputValueClass(Text.class); 
    job.setInputFormatClass(TextInputFormat.class); 
    job.setOutputFormatClass(TextOutputFormat.class); 

    job.setNumReduceTasks(0); 

    FileInputFormat.addInputPath(job, new Path(args[0])); 
    FileOutputFormat.setOutputPath(job, new Path(args[1])); 

    job.waitForCompletion(true); 
} 

}

，並使用此bash腳本來運行它

echo "Running Job" 
hadoop jar RandomSample.jar RandomSample $1 tmp 
echo "copying result to local path (RandomSample)" 
hadoop fs -getmerge tmp RandomSample 
echo "Clean up" 
hadoop fs -rmr tmp

例如，如果我們將腳本命名爲random_sample.sh，要從文件夾/example/中選擇10％，只需運行

./random_sample.sh /example/

來源

2014-12-24 11:33:01

Hadoop Mapper的採樣記錄

回答

相關問題