2012-09-03 57 views
0

我正在與Hadoop MapRedue一起工作,並有一個問題。 目前,我的映射器的input KV typeLongWritable, LongWritable typeoutput KV type也是LongWritable, LongWritable type。 InputFileFormat是SequenceFileInputFormat。 基本上我想要做的是將一個txt文件轉換爲SequenceFileFormat,以便我可以將其用於我的映射器。爲Hadoop創建序列文件格式MR

我想什麼做的是

輸入文件是這樣的

1\t2 (key = 1, value = 2)

2\t3 (key = 2, value = 3)

和和...

我看着這個線程How to convert .txt file to Hadoop's sequence file format但reliazing,TextInputFormat只支持Key = LongWritable and Value = Text

有沒有什麼方法可以獲得txt並在KV = LongWritable, LongWritable中創建序列文件?

回答

7

當然,基本上和我在你鏈接的其他線索中講過的方式一樣。但是你必須實現你自己的Mapper

只是一個快速從頭給你:

public class LongLongMapper extends 
    Mapper<LongWritable, Text, LongWritable, LongWritable> { 

    @Override 
    protected void map(LongWritable key, Text value, 
     Mapper<LongWritable, Text, LongWritable, LongWritable>.Context context) 
     throws IOException, InterruptedException { 

    // assuming that your line contains key and value separated by \t 
    String[] split = value.toString().split("\t"); 

    context.write(new LongWritable(Long.valueOf(split[0])), new LongWritable(
     Long.valueOf(split[1]))); 

    } 

    public static void main(String[] args) throws IOException, 
     InterruptedException, ClassNotFoundException { 

    Configuration conf = new Configuration(); 
    Job job = new Job(conf); 
    job.setJobName("Convert Text"); 
    job.setJarByClass(LongLongMapper.class); 

    job.setMapperClass(Mapper.class); 
    job.setReducerClass(Reducer.class); 

    // increase if you need sorting or a special number of files 
    job.setNumReduceTasks(0); 

    job.setOutputKeyClass(LongWritable.class); 
    job.setOutputValueClass(LongWritable.class); 

    job.setOutputFormatClass(SequenceFileOutputFormat.class); 
    job.setInputFormatClass(TextInputFormat.class); 

    FileInputFormat.addInputPath(job, new Path("/input")); 
    FileOutputFormat.setOutputPath(job, new Path("/output")); 

    // submit and wait for completion 
    job.waitForCompletion(true); 
    } 
} 

在您的映射函數的每個值將得到一個行你輸入的,所以我們只是通過你的分隔符(標籤)分裂,並分析它的每一部分進入多頭。

就是這樣。

+0

謝謝你,從骨架中得到了很多想法,並能夠創建一個seq。文件編寫者。 – user1566629

+0

如果你有另一個例子,請給我,所以我可以更好地理解它,電子郵件ID [email protected],在此先感謝親愛的 –

+0

並請告訴我什麼將是reducer類輸入輸出格式,我的意思是關鍵和價值用於輸入和輸出 –