2013-11-26 72 views
0

我面臨以下情況。請幫助我。我使用hadoop Mapreduce來處理XML文件。作爲hadoop中的單個輸入格式

通過闖民宅本網站即時通訊能夠slipt我的記錄https://gist.github.com/sritchie/808035 但是當XML文件的大小大於塊大小IM沒有得到應有的價值 大,所以我需要讀取整個文件 對於我得到這個鏈接

https://github.com/pyongjoo/MapReduce-Example/blob/master/mysrc/XmlInputFormat.java

但現在的問題是如何實現兩個inputformat作爲一個單一的inputformat

請幫助我很快 感謝

UPDATE

public class XmlParser11 
{ 

     public static class XmlInputFormat1 extends TextInputFormat { 

     public static final String START_TAG_KEY = "xmlinput.start"; 
     public static final String END_TAG_KEY = "xmlinput.end"; 

     @Override 
    protected boolean isSplitable(JobContext context, Path file) { 
     return false; 
     } 


     public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) { 
      return new XmlRecordReader(); 
     } 

     /** 
     * XMLRecordReader class to read through a given xml document to output 
     * xml blocks as records as specified by the start tag and end tag 
     * 
     */ 


     public static class XmlRecordReader extends RecordReader<LongWritable, Text> { 
      private byte[] startTag; 
      private byte[] endTag; 
      private long start; 
      private long end; 
      private FSDataInputStream fsin; 
      private DataOutputBuffer buffer = new DataOutputBuffer(); 

      private LongWritable key = new LongWritable(); 
      private Text value = new Text(); 
      @Override 
      public void initialize(InputSplit split, TaskAttemptContext context) 
        throws IOException, InterruptedException { 
       Configuration conf = context.getConfiguration(); 
       startTag = conf.get(START_TAG_KEY).getBytes("utf-8"); 
       endTag = conf.get(END_TAG_KEY).getBytes("utf-8"); 
       FileSplit fileSplit = (FileSplit) split; 

但不工作

回答

1

使用isSplitable屬性來指定不分割文件(即使達到塊大小)。當你想確保一個大文件應該被一個映射器處理時,通常會使用它。

public class XmlInputFormat extends FileInputFormat { 
@Override 
protected boolean isSplitable(JobContext context, Path file) { 
return false; 
} 

@Override 
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split,TaskAttemptContext context) 
throws IOException { 
    // return your version of XML record reader 
} 
} 

另外,您還可以增加的塊大小利用一切分裂:

// Set the maximum split size 
setMaxSplitSize(MAX_INPUT_SPLIT_SIZE); 
+0

但我們需要寫RecordReader權。我有一個用於xml閱讀器的RecordReader,那麼我怎樣才能將整個文件閱讀器合併到它 – Backtrack

+0

請看看這篇文章。我編輯過它。 –

+0

+1。我已更新帖子看看 – Backtrack

相關問題