2013-06-19 44 views
0

我有一個數據集,其關鍵由3部分組成:a,b和c。在我的映射器中,我想發出記錄,其中鍵爲'a',值爲'a,b,c'Hadoop Mapper的採樣記錄

如何發出檢測到的每個'a'的總記錄數的10%來自Hadoop中的映射器?是否應該考慮將以前的Map-Reduce作業中每個'a'所見的記錄總數保存在臨時文件中?

+0

是否需要正好10%或接近10%? – climbage

+0

前者,但我很想聽到兩者的答案。我用後者猜測,它會類似於油藏採樣? – syker

+0

是的,這就是我的想法。否則,您可能需要計算地圖階段中的按鍵數量,然後使用該按鍵在縮小中僅放出10%。只是一個想法 – climbage

回答

0

如果你想接近10%,你可以使用隨機。這裏是映射的一個例子:

public class Test extends Mapper<LongWritable, Text, LongWritable, Text> { 

    private Random r = new Random(); 

    @Override 
    public void map(LongWritable key, Text value, Context context) 
      throws IOException, InterruptedException { 
     if (r.nextInt(10) == 0) { 
      context.write(key, value); 
     } 
    } 

} 
0

使用此Java代碼來隨機選擇10%:

import java.io.IOException; 
import java.util.*; 

import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.conf.*; 
import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapreduce.*; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 

public class RandomSample { 

public static class Map extends Mapper<LongWritable, Text, Text, Text> { 
    private Text word = new Text(); 

    public void map(LongWritable key, Text value, Context context) 
    throws IOException, InterruptedException { 
     if (Math.random()<0.1) 
      context.write(value,null); 
     else 
      context.write(null,null); 
    context.write(value,null); 
    } 
} 

public static void main(String[] args) throws Exception { 
    Configuration conf = new Configuration(); 

    Job job = new Job(conf, "randomsample"); 
    job.setJarByClass(RandomSample.class); 

    job.setOutputKeyClass(Text.class); 
    job.setOutputValueClass(IntWritable.class); 

    job.setOutputKeyClass(Text.class); 
    job.setOutputValueClass(Text.class); 
    job.setInputFormatClass(TextInputFormat.class); 
    job.setOutputFormatClass(TextOutputFormat.class); 

    job.setNumReduceTasks(0); 

    FileInputFormat.addInputPath(job, new Path(args[0])); 
    FileOutputFormat.setOutputPath(job, new Path(args[1])); 

    job.waitForCompletion(true); 
} 

} 

,並使用此bash腳本來運行它

echo "Running Job" 
hadoop jar RandomSample.jar RandomSample $1 tmp 
echo "copying result to local path (RandomSample)" 
hadoop fs -getmerge tmp RandomSample 
echo "Clean up" 
hadoop fs -rmr tmp 

例如,如果我們將腳本命名爲random_sample.sh,要從文件夾/example/中選擇10%,只需運行

./random_sample.sh /example/