Hadoop將數據從映射器減少到組合器

我有一個包含URL +變量ammount關鍵字的文本輸入文件。這看起來是這樣的：Hadoop將數據從映射器減少到組合器

facebook.com社會新聞的朋友
msn.com新聞郵件
yahoo.com金融新聞

我需要這被轉化爲輸出如：

社會facebook.com
新聞facebook.com msn.com yahoo.com
朋友facebook.com
金融yahoo.com

我的映射類看起來像這樣：

public class KeywordsMapper extends Mapper<LongWritable, Text, Text, Text> { 
private Text urlkey = new Text(); 
@Override 
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 
    String[] line = value.toString().split(" "); 
    ArrayList<String> keywords = new ArrayList<String>(); 
    for (String sequence : line) { 
     if (sequence.endsWith(".com")) { 
      // url 
      urlkey.set(sequence); 
     } else { 
      // keyword 
      keywords.add(sequence); 
     } 
    } 
    for (String keyword : keywords) { 
     context.write(new Text(keyword), urlkey); 
    } 
} 
}

我減速/合成類看起來是這樣的：

public class KeywordReducer extends Reducer<Text, Iterable<Text>, Text, Text> { 
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { 
    String body = ""; 
    for(Text part : values){ 
     body = body + " " + part.toString() + " "; 
    } 
    context.write(key, new Text(body)); 
} 
}

的工作是這樣的：

public class KeywordJob extends Configured implements Tool{ 

@Override 
public int run(String[] arg0) throws Exception { 
    Job job = new Job(getConf()); 
    job.setJarByClass(getClass()); 
    job.setJobName(getClass().getSimpleName()); 

    FileInputFormat.addInputPath(job, new Path(arg0[0])); 
    FileOutputFormat.setOutputPath(job, new Path(arg0[1])); 

    job.setMapperClass(KeywordsMapper.class); 
    job.setCombinerClass(KeywordReducer.class); 
    job.setReducerClass(KeywordReducer.class); 


    job.setOutputKeyClass(Text.class); 
    job.setOutputValueClass(Text.class); 

    return job.waitForCompletion(true) ? 0 : 1; 
} 

public static void main(String[]args) throws Exception{ 
    int rc = ToolRunner.run(new KeywordJob(), args); 
    System.exit(rc); 

} 

}

輸出我目前得到的是：

與輸入的檔案：

yahoo.com news sports finance email celebrity 
amazon.com shoes books jeans 
google.com news finance email search 
microsoft.com operating-system productivity search 
target.com shoes books jeans groceries 
wegmans.com books groceries 
facebook.com news social sports 
linkedin.com news recruitment

問題：我需要如何調整我的組合器/減速器才能獲得所需的輸出？是否有一個特定的原因，爲什麼輸出包含多個重複鍵，以及它們如何不被合併？

來源

2016-02-12 Mark Stroeven

請問你能具體嗎？我的意思是什麼問題。即使輸出也沒有出現...！ ..我目前得到的輸出是： – Anirudh

如果您想知道問題是什麼，請閱讀我的問題的第一部分。我在hadoop 2.3上運行這個僞分佈式系統。 –

改寫我的評論 - 代碼生成的輸出在帖子中不可見。 – Anirudh

馬克，

減速機沒有被調用/調用。

減速機類的定義應該像 - 的

public class KeywordReducer extends Reducer<Text, Text, Text, Text>

代替

public class KeywordReducer extends Reducer<Text, Iterable<Text>, Text, Text>

爲地圖輸出應符合這一點。 reduce（）方法簽名是正確的。

希望這會有所幫助。

來源

2016-02-12 10:39:11 Anirudh

這是正確的答案。非常感謝您的幫助！這是一個常見的規則，即地圖的輸出簽名應該與減速器的輸入相同？ –

是的，它是@MarkStroeven。你沒有使用IDE嗎？你沒有收到任何錯誤訊息嗎？這會有助於早日看到錯誤。 –

是的，日食，但hadoop構建運行爲僞分佈式系統。這意味着我必須用maven導出jar/build並通過我的命令行終端中的hadoop運行它。我沒有找到合適的eclipse插件的成功。 –

Hadoop將數據從映射器減少到組合器

回答

相關問題