如何在運行時在HADOOP中生成多個文件名？

如K1，K2，數據1，數據2，數據3

這裏我映射器傳遞的關鍵在於減速的K1K2 &值數據1，數據2，數據3

我想保存在多個文件中這一數據文件名爲K1k2（或減速器獲取的鍵）。現在如果我使用MultipleOutputs類，我必須在映射器開始之前提及文件名。但在這裏，因爲只有在讀取來自mapper的數據之後，我才能確定密鑰。我應該如何繼續？

PS我是新來的。

2014-02-11 Sanchit

您可以生成的文件名，並將其傳遞給MultipleOutputs在減速這樣的：

public void setup(Context context) { 
    out = new MultipleOutputs(context); 
    ... 
} 

public void reduce(Text key, Iterable values, Context context) throws IOException,   InterruptedException { 
    for (Text t : values) { 
    out.write(key, t, generateFileName(<parameter list...>)); 
    // generateFileName is your function 
    } 
} 

protected void cleanup(Context context) throws IOException, InterruptedException { 
    out.close(); 
}

有關詳細信息閱讀MultipleOutputs類參考：https://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

來源

2014-02-11 13:43:50

沒有，但它給出了一個錯誤java.lang.IllegalArgumentException異常：命名輸出「K1K2」不org.apache.hadoop.mapreduce.lib.output.MultipleOutputs定義 \t。 checkNamedOutputName（MultipleOutputs.java:193） – Sanchit

如果我添加MultipleOutputs.addNamedOutput（job，FileName1.toString（），TextOutputFormat.class，NullWritable.class，Text.class）;在generateOutput（）方法中，我如何在減速器中獲得工作。我剛開始這可能是一個非常基本的問題？ – Sanchit

不需要命名輸出。看看我的帖子 –

-1

無需預定義的輸出文件名。這裏你可以像這樣使用MultipleOutputs。

public class YourReducer extends Reducer<Text, Value, Text, Value> { 
private Value result = null; 
private MultipleOutputs<Text,Value> out; 

public void setup(Context context) { 
    out = new MultipleOutputs<Text,Value>(context);  
} 
public void reduce(Text key, Iterable<Value> values, Context context) 
     throws IOException, InterruptedException { 
    // do your code 
    out.write(key, result,"outputpath/"+key.getText());     
} 
public void cleanup(Context context) throws IOException,InterruptedException { 
    out.close();   
}

}

這給出了以下路徑輸出

outputpath/K1 
      /K2 
      /K3 
.......

爲此，您應該使用LazyOutputFormat.setOutputFormatClass()，而不是FileOutputFormat。還需要添加作業配置爲job.setOutputFormatClass(NullOutputFormat.class)。但不要忘記像以前一樣使用FileOutputFormat.setOutputPath()和FileOutputFormat.setOutputPath()來輸入和輸出路徑。然後將生成的文件將相對於指定outputpath

來源

2014-02-12 09:02:22

...並且您必須在運行作業的'驅動程序'中定義MultipleOutputs。正確？ – OhadR

你的意思是定義多個輸出和驅動程序？ –

運行作業的文件必須調用MultipleOutputs.addNamedOutput（job，...，TextOutputFormat.class，...） – OhadR

如何在運行時在HADOOP中生成多個文件名？

回答

相關問題