2015-10-02 61 views
0

我正在嘗試使用MapReduce Hadoop技術對wordcount程序進行計數。我需要做的是開發一個索引字計數應用程序,它將計算給定輸入文件集中每個文件中每個字的出現次數。該文件集存在於Amazon S3存儲桶中。它也會計算每個單詞的總髮生次數。我附加了計算給定文件集中單詞出現次數的代碼。在此之後,我需要打印哪個文件在哪個文件中出現的單詞在該特定文件中出現的次數。MapReduce Apache Hadoop技術

我知道它有點複雜,但任何將不勝感激。

Map.java

import java.io.IOException; 
import java.util.*; 

import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapreduce.*; 
import org.apache.hadoop.mapreduce.lib.input.FileSplit; 

public class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 
    private final static IntWritable one = new IntWritable(1); 
    private Text word = new Text(); 
    private String pattern= "^[a-z][a-z0-9]*$"; 

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 
     String line = value.toString(); 
     StringTokenizer tokenizer = new StringTokenizer(line); 
     InputSplit inputSplit = context.getInputSplit(); 
     String fileName = ((FileSplit) inputSplit).getPath().getName(); 
     while (tokenizer.hasMoreTokens()) { 
      word.set(tokenizer.nextToken()); 
      String stringWord = word.toString().toLowerCase(); 
      if (stringWord.matches(pattern)){ 
       context.write(new Text(stringWord), one); 
      } 

     } 
    } 
} 

Reduce.java

import java.io.IOException; 

import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapreduce.*; 

public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 

    public void reduce(Text key, Iterable<IntWritable> values, Context context) 
    throws IOException, InterruptedException { 
     int sum = 0; 
     for (IntWritable val : values) { 
      sum += val.get(); 
     } 
     context.write(key, new IntWritable(sum)); 
    } 
} 

WordCount.java

import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.conf.*; 
import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapreduce.*; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 

public class WordCount { 
    public static void main(String[] args) throws Exception { 
     Configuration conf = new Configuration(); 

     Job job = new Job(conf, "WordCount"); 
     job.setJarByClass(WordCount.class); 
     job.setOutputKeyClass(Text.class); 
     job.setOutputValueClass(IntWritable.class); 

     job.setNumReduceTasks(3); 

     job.setMapperClass(Map.class); 
     job.setReducerClass(Reduce.class); 

     job.setInputFormatClass(TextInputFormat.class); 
     job.setOutputFormatClass(TextOutputFormat.class); 

     FileInputFormat.addInputPath(job, new Path(args[0])); 
     FileOutputFormat.setOutputPath(job, new Path(args[1])); 

     job.waitForCompletion(true); 
    } 
} 
+0

你的問題在哪裏? – Roman

+0

對不起,我沒有明白。 –

+0

這個網站是問題和答案。您的帖子中沒有單個問號。那麼你究竟在問什麼? – Roman

回答

2

在映射器,創建可寫textpair自定義這將是輸出密鑰將保存您的文件的文件名和文字,並將其值設爲1.

映射輸出:

<K,V> ==> <MytextpairWritable,new IntWritable(1) 

你可以得到下面的代碼片段在映射文件名。

FileSplit fileSplit = (FileSplit)context.getInputSplit(); 
String filename = fileSplit.getPath().getName(); 

並將這些作爲構造函數傳遞給context.write中的自定義可寫類。像這樣的東西。

context.write(new MytextpairWritable(filename,word),new IntWritable(1)); 

而在減速裝置只是總結的價值,這樣你可以得到每個文件多少次是有一個特定的詞。 Reducer代碼會是這樣的。

public class Reduce extends Reducer<mytextpairWritable, IntWritable,mytextpairWritable, IntWritable> { 


    public void reduce(mytextpairWritable key, Iterable<IntWritable> values , Context context) 
    throws IOException, InterruptedException { 
     int sum = 0; 
     for(IntWritable val: values){ 
      sum+=val.get(); 
      } 
     context.write(key, new IntWritable(sum)); 
} 

您的輸出將是這樣的。

File1,hello,2 
File2,hello,3 
File3,hello,1