1
我有一個有文本和 「^」 之間數據的文件:的Hadoop MapReduce的:自定義輸入格式
SOME TEXT^GOES HERE^
和幾個^更多
GOES HERE
我我正在編寫自定義輸入格式來使用「^」字符分隔行。即映射器的輸出應該是這樣的:
SOME TEXT
GOES HERE
和幾個
更多GOES HERE
我寫延伸FileInputFormat,也寫了一個書面的自定義輸入格式自定義記錄閱讀器,擴展RecordReader。下面給出了我的自定義記錄閱讀器的代碼。我不知道如何處理這段代碼。 WHILE循環部分中的nextKeyValue()方法有問題。我應該如何從分割中讀取數據並生成我的自定義鍵值?我正在使用所有新的mapreduce包而不是舊的mapred包。
public class MyRecordReader extends RecordReader<LongWritable, Text>
{
long start, current, end;
Text value;
LongWritable key;
LineReader reader;
FileSplit split;
Path path;
FileSystem fs;
FSDataInputStream in;
Configuration conf;
@Override
public void initialize(InputSplit inputSplit, TaskAttemptContext cont) throws IOException, InterruptedException
{
conf = cont.getConfiguration();
split = (FileSplit)inputSplit;
path = split.getPath();
fs = path.getFileSystem(conf);
in = fs.open(path);
reader = new LineReader(in, conf);
start = split.getStart();
current = start;
end = split.getLength() + start;
}
@Override
public boolean nextKeyValue() throws IOException
{
if(key==null)
key = new LongWritable();
key.set(current);
if(value==null)
value = new Text();
long readSize = 0;
while(current<end)
{
Text tmpText = new Text();
readSize = read //here how should i read data from the split, and generate key-value?
if(readSize==0)
break;
current+=readSize;
}
if(readSize==0)
{
key = null;
value = null;
return false;
}
return true;
}
@Override
public float getProgress() throws IOException
{
}
@Override
public LongWritable getCurrentKey() throws IOException
{
}
@Override
public Text getCurrentValue() throws IOException
{
}
@Override
public void close() throws IOException
{
}
}