我正在學習MapReduce,並且我想要讀取輸入文件(逐句句子),並且只在不包含單詞「snake」時將每個句子寫入輸出文件。使用MapReduce刪除包含特定單詞的整個句子
E.g.輸入文件:
This is my first sentence. This is my first sentence.
This is my first sentence.
The snake is an animal. This is the second sentence. This is my third sentence.
Another sentence. Another sentence with snake.
然後輸出文件應該是:
This is my first sentence. This is my first sentence.
This is my first sentence.
This is the second sentence. This is my third sentence.
Another sentence.
要做到這一點,我檢查時,map
方法中,如果句子(value
)包含了一字長蛇。如果句子不包含蛇字,那麼我在context
中寫下該句。
此外,我將reducer任務的數量設置爲0,否則在輸出文件中我按隨機順序(例如,第一個句子,然後是第三個句子,然後是第二個句子等等)獲取該句子。
我的代碼不正確過濾與蛇字的句子,但問題是,它寫入每個句子中一個新的生產線,是這樣的:
This is my first sentence.
This is my first sentence.
This is my first sentence.
This is the second sentence.
This is my third sentence.
Another sentence.
.
我怎麼能在一個新的行寫一個句子只如果該句子出現在輸入文本的新行中?下面是我的代碼:
public class RemoveSentence {
public static class SentenceMapper extends Mapper<Object, Text, Text, NullWritable>{
private Text removeWord = new Text ("snake");
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
if (!value.toString().contains(removeWord.toString())) {
Text currentSentence = new Text(value.toString()+". ");
context.write(currentSentence, NullWritable.get());
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("textinputformat.record.delimiter", ".");
Job job = Job.getInstance(conf, "remove sentence");
job.setJarByClass(RemoveSentence.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
job.setMapperClass(SentenceMapper.class);
job.setNumReduceTasks(0);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
This和this other解決方案說,應該足以設置context.write(word, null);
但對我來說沒有工作。
還有一個問題與conf.set("textinputformat.record.delimiter", ".");
有關。那麼,這就是我如何定義句子之間的分隔符,並且由於這種情況有時輸出文件中的句子以空格開始(例如第二個This is my first sentence.
)。作爲替代方案,我試圖設置它像這樣conf.set("textinputformat.record.delimiter", ". ");
(在句號之後有一個空格),但是這樣Java應用程序不會在輸出文件中寫入所有句子。