重寫RecordReader類的方法「下一步」和TextInputFormat類的「getRecordReader」以便發送整個段落到映射器而不是逐行。 (我用舊的API和認定中對我的款追加至一個空行來在我的文本文件的時間。)
下面是我的代碼:覆蓋RecordReader一次而不是行
public class NLinesInputFormat extends TextInputFormat
{
@Override
public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter)throws IOException {
reporter.setStatus(split.toString());
return new ParagraphRecordReader(conf, (FileSplit)split);
}
}
public class ParagraphRecordReader implements RecordReader<LongWritable, Text>
{
private LineRecordReader lineRecord;
private LongWritable lineKey;
private Text lineValue;
public ParagraphRecordReader(JobConf conf, FileSplit split) throws IOException {
lineRecord = new LineRecordReader(conf, split);
lineKey = lineRecord.createKey();
lineValue = lineRecord.createValue();
}
@Override
public void close() throws IOException {
lineRecord.close();
}
@Override
public LongWritable createKey() {
return new LongWritable();
}
@Override
public Text createValue() {
return new Text("");
}
@Override
public float getProgress() throws IOException {
return lineRecord.getPos();
}
@Override
public synchronized boolean next(LongWritable key, Text value) throws IOException {
boolean appended, gotsomething;
boolean retval;
byte space[] = {' '};
value.clear();
gotsomething = false;
do {
appended = false;
retval = lineRecord.next(lineKey, lineValue);
if (retval) {
if (lineValue.toString().length() > 0) {
byte[] rawline = lineValue.getBytes();
int rawlinelen = lineValue.getLength();
value.append(rawline, 0, rawlinelen);
value.append(space, 0, 1);
appended = true;
}
gotsomething = true;
}
} while (appended);
//System.out.println("ParagraphRecordReader::next() returns "+gotsomething+" after setting value to: ["+value.toString()+"]");
return gotsomething;
}
@Override
public long getPos() throws IOException {
return lineRecord.getPos();
}
}
問題:
1.我沒有找到任何具體的指導如何做到這一點,所以可能是我做錯了,請評論任何建議?
2.我能夠正確編譯,但是當我運行我的作業時,我的映射器不斷運行,我無法弄清楚問題出在哪裏?
您是否嘗試過僅輸入一個段落? – Amar 2013-03-25 09:02:21
我認爲你有一個bug;當你穿越分裂時你會得到額外的段落。我認爲你需要區分從0開始的分割和其他分割。從0開始的第一行開始一段,但以行開頭的分割不應該開始一個新段落。 (通常情況下,你已經讀過一個拆分邊界,所以如果你的拆分文件有連續段落的行,它們將會被前一個拆分文件所發出)。我錯過了什麼嗎? – 2017-04-15 22:10:56