2016-11-29 67 views
1

我正在學習MapReduce,並且我想要讀取輸入文件(逐句句子),並且只在不包含單詞「snake」時將每個句子寫入輸出文件。使用MapReduce刪除包含特定單詞的整個句子

E.g.輸入文件:

This is my first sentence. This is my first sentence. 
This is my first sentence. 

The snake is an animal. This is the second sentence. This is my third sentence. 

Another sentence. Another sentence with snake. 

然後輸出文件應該是:

This is my first sentence. This is my first sentence. 
This is my first sentence. 

This is the second sentence. This is my third sentence. 

Another sentence. 

要做到這一點,我檢查時,map方法中,如果句子(value)包含了一字長蛇。如果句子不包含蛇字,那麼我在context中寫下該句。

此外,我將reducer任務的數量設置爲0,否則在輸出文件中我按隨機順序(例如,第一個句子,然後是第三個句子,然後是第二個句子等等)獲取該句子。

我的代碼不正確過濾與蛇字的句子,但問題是,它寫入每個句子中一個新的生產線,是這樣的:

This is my first sentence. 
This is my first sentence. 

This is my first sentence. 
This is the second sentence. 
This is my third sentence. 


Another sentence. 

. 

我怎麼能在一個新的行寫一個句子只如果該句子出現在輸入文本的新行中?下面是我的代碼:

public class RemoveSentence { 

    public static class SentenceMapper extends Mapper<Object, Text, Text, NullWritable>{ 

     private Text removeWord = new Text ("snake"); 

     public void map(Object key, Text value, Context context) throws IOException, InterruptedException { 
      if (!value.toString().contains(removeWord.toString())) { 
       Text currentSentence = new Text(value.toString()+". "); 
       context.write(currentSentence, NullWritable.get()); 
      } 
     } 
    } 


    public static void main(String[] args) throws Exception { 
     Configuration conf = new Configuration(); 
     conf.set("textinputformat.record.delimiter", "."); 

     Job job = Job.getInstance(conf, "remove sentence"); 
     job.setJarByClass(RemoveSentence.class); 

     FileInputFormat.addInputPath(job, new Path(args[0])); 
     FileOutputFormat.setOutputPath(job, new Path(args[1])); 

     job.setMapOutputKeyClass(Text.class); 
     job.setMapOutputValueClass(NullWritable.class); 

     job.setMapperClass(SentenceMapper.class); 
     job.setNumReduceTasks(0); 

     System.exit(job.waitForCompletion(true) ? 0 : 1); 
    } 
} 

Thisthis other解決方案說,應該足以設置context.write(word, null);但對我來說沒有工作。

還有一個問題與conf.set("textinputformat.record.delimiter", ".");有關。那麼,這就是我如何定義句子之間的分隔符,並且由於這種情況有時輸出文件中的句子以空格開始(例如第二個This is my first sentence.)。作爲替代方案,我試圖設置它像這樣conf.set("textinputformat.record.delimiter", ". ");(在句號之後有一個空格),但是這樣Java應用程序不會在輸出文件中寫入所有句子。

回答

0

你很接近解決問題。想想你的MapReduce程序是如何工作的。您的地圖方法將每個由「。」分隔的句子取出。 (默認情況下是換行符)作爲新值,然後將其寫入文件。您需要一個屬性,禁止在每次調用map()之後寫入換行符。我不確定,但我不認爲這樣的財產存在。

一種解決方法是讓它像平常一樣進行處理。例如記錄將是:

This is first sentence. This is second snake. This is last.

查找單詞「蛇」,如果發現後,馬上刪除以前的一切「」到下一個 」。」打包新的字符串並將其寫入上下文。

當然,如果你可以在map()調用之後找到禁用換行符的方法,那麼這將是最簡單的。

希望這會有所幫助。

相關問題