如何在Hadoop中讀取並處理文件作爲keyvaluepair

我試圖將以下數據作爲Hadoop中的鍵值對讀取。如何在Hadoop中讀取並處理文件作爲keyvaluepair

name: "Clooney, George", release: "2013", movie: "Gravity", 
name: "Pitt, Brad", release: "2004", movie: "Ocean's 12", 
name: Clooney, George", release: "2004", movie: "Ocean's 12", 
name: "Pitt, Brad", release: "1999", movie: "Fight Club"

我需要的輸出如下：

name: "Clooney, George", movie: "Gravity, Ocean's 12", 
name: "Pitt, Brad", movie: "Ocean's 12, Fight Club",

我寫了一個映射器和減速如下：

public static class MyMapper 
     extends Mapper<Text, Text, Text, Text>{ 

    private Text word = new Text(); 

    public void map(Text key, Text value, Context context 
        ) throws IOException, InterruptedException { 
     StringTokenizer itr = new StringTokenizer(value.toString(),","); 
    while (itr.hasMoreTokens()) { 
    word.set(itr.nextToken()); 
    context.write(key, word); 
    } 
} 
} 
    public static class MyReducer 
     extends Reducer<Text,Text,Text,Text> { 
    private Text result = new Text(); 

    public void reduce(Text key, Iterable<Text> values, 
         Context context 
         ) throws IOException, InterruptedException { 
     String actors = ""; 
     for (Text val : values) { 
     actors += val.toString(); 
     } 
     result.set(actors); 
     context.write(key, result); 
    } 
    }

我還增加了如下配置的詳細信息：

Configuration conf = new Configuration(); 
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");

我得到了f ollowing輸出：

name: "Clooney George" release: "2013" movie: "Gravity" George" release: "2004" movie: "Ocean's 12" 
name: "Pitt Brad" release: "2004" movie: "Ocean's 12" Brad" release: "1999" movie: "Fight Club"

好像我甚至不能夠得到基本的鍵值對閱讀的權利。 Hadoop中的鍵值處理如何？有人能詳細說明這一點，並指出我要出錯的地方嗎？

謝謝。 TM

來源

2013-11-01 visakh

您的問題與KeyValueTextInputFormat不符合輸入記錄中的引號，只是查找您定義的第一個分隔符（逗號），並將Key定義爲該字符前的所有內容，並將值第一個分隔符後的所有內容。

所以你的映射器被送到下面的輸入鍵/第一個記錄值：

重點：name: "Clooney
值：George", release: "2013", movie: "Gravity",

爲了解決這個問題，我想你應該切換回僅使用TextInpurFormat，然後將提取邏輯委託給映射器的映射方法。

來源

2013-11-01 11:47:05

感謝您的回覆。但是TextInputFormat不是逐行讀取行嗎？是否可以使用它來單獨處理一條線的記錄？ – visakh

還有一個問題 - 我如何顯示Mapper的輸出？我想看看Mapper如何處理記錄？我可以通過標準的Java命令做到這一點嗎？ – visakh

@ user295338顯示映射器輸出或者禁用reducer，或者使用[MultipleOutputs]（http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html）。使用MultipleOutputs將映射器數據發送到其他文件，然後將其發送到縮減器。這允許您在不禁用Reducer的情況下存儲映射器的數據。 –

如何在Hadoop中讀取並處理文件作爲keyvaluepair

回答

相關問題