0
我們有一個爲文件中的每一行生成唯一鍵的場景。我們有一個timestamp列,但是在少數情況下可以有多個行可用於同一個時間戳。減速器獲得的記錄數比預期的少
如下面的程序中所提到的,我們決定使用它們各自的計數附加時間戳的唯一值。
映射器將只發出時間戳記作爲鍵和整個行作爲其值,並在縮減器中生成密鑰。
問題是地圖輸出大約236行,其中只有230條記錄作爲輸入輸出的減速器輸出相同的230條記錄。
public class UniqueKeyGenerator extends Configured implements Tool {
private static final String SEPERATOR = "\t";
private static final int TIME_INDEX = 10;
private static final String COUNT_FORMAT_DIGITS = "%010d";
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
@Override
protected void map(LongWritable key, Text row, Context context)
throws IOException, InterruptedException {
String input = row.toString();
String[] vals = input.split(SEPERATOR);
if (vals != null && vals.length >= TIME_INDEX) {
context.write(new Text(vals[TIME_INDEX - 1]), row);
}
}
}
public static class Reduce extends Reducer<Text, Text, NullWritable, Text> {
@Override
protected void reduce(Text eventTimeKey,
Iterable<Text> timeGroupedRows, Context context)
throws IOException, InterruptedException {
int cnt = 1;
final String eventTime = eventTimeKey.toString();
for (Text val : timeGroupedRows) {
final String res = SEPERATOR.concat(getDate(
Long.valueOf(eventTime)).concat(
String.format(COUNT_FORMAT_DIGITS, cnt)));
val.append(res.getBytes(), 0, res.length());
cnt++;
context.write(NullWritable.get(), val);
}
}
}
public static String getDate(long time) {
SimpleDateFormat utcSdf = new SimpleDateFormat("yyyyMMddhhmmss");
utcSdf.setTimeZone(TimeZone.getTimeZone("America/Los_Angeles"));
return utcSdf.format(new Date(time));
}
public int run(String[] args) throws Exception {
conf(args);
return 0;
}
public static void main(String[] args) throws Exception {
conf(args);
}
private static void conf(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "uniquekeygen");
job.setJarByClass(UniqueKeyGenerator.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
// job.setNumReduceTasks(400);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
這對於更高的no行是一致的,並且差異像輸入20855982行的208969記錄一樣巨大。減速機投入減少的原因是什麼?
您如何知道從地圖上寫入的記錄數量?計數器? – climbage
從運行MR之後發出的最終成功日誌中,Mapper輸出的no爲236,reducer輸入的no爲230 – sathishs