徹底取消默認輸出目錄 - MapReduce

我有一個代碼用於使用org.apache.hadoop.mapreduce.lib.output.MultipleOutputs來編寫多個輸出。徹底取消默認輸出目錄 - MapReduce

Reducer將結果寫入預先創建的位置，因此我不需要默認的o/p目錄（其中包含_history和_SUCCESS目錄）。

我必須在每次再次運行我的工作前刪除它們。

所以我刪除了TextOutputFormat.setOutputPath(job1,new Path(outputPath));一行。有

if(condition1) 
    mos.write("path1", key, new LongWritable(value), path_list[0]); 
else 
    mos.write("path2", key, new LongWritable(value), path_list[1]);

是，一個解決辦法，以避免指定一個默認的輸出目錄：但是，這給了我（預期）錯誤

驅動程序類：

MultipleOutputs.addNamedOutput(job1, "path1", TextOutputFormat.class, Text.class,LongWritable.class); 
MultipleOutputs.addNamedOutput(job1, "path2", TextOutputFormat.class, Text.class,LongWritable.class); 
LazyOutputFormat.setOutputFormatClass(job1,TextOutputFormat.class);

減速類？

來源

2013-09-24 Suvarna Pattayil

您運行的是哪個版本的Hadoop？

要獲得快速解決方法，可以通過編程方式設置輸出位置，並調用FileSystem.delete以在作業完成時將其刪除。

來源

2013-09-24 19:22:40 joews

我使用CDH4。這就是我目前所做的。我只是想知道是否有辦法調整它不寫。 –

我不認爲_SUCCESS是一個目錄，而另一個history目錄駐留在_logs目錄中。

首先TextOutputFormat.setOutputPath(job1,new Path(outputPath));非常重要，因爲作業運行時，Hadoop將此路徑作爲工作目錄，以便爲不同的任務（_temporary dir）創建臨時文件等。這個_temporary目錄和文件最終會在作業結束時被刪除。文件_SUCCESS和歷史目錄實際上保留在工作目錄下並在作業成功完成後保留。 _SUCCESS文件是一種標誌，表示作業實際上已成功運行。請看at this link。

文件_SUCCESS由TextOutputFormat類完成的創作，你實際使用，又使用FileOutputComitter類。這FileOutputCommiter類定義的函數，如本 -

public static final String SUCCEEDED_FILE_NAME = "_SUCCESS"; 
/** 
    * Delete the temporary directory, including all of the work directories. 
    * This is called for all jobs whose final run state is SUCCEEDED 
    * @param context the job's context. 
    */ 
    public void commitJob(JobContext context) throws IOException { 
    // delete the _temporary folder 
    cleanupJob(context); 
    // check if the o/p dir should be marked 
    if (shouldMarkOutputDir(context.getConfiguration())) { 
     // create a _success file in the o/p folder 
     markOutputDirSuccessful(context); 
    } 
    } 

// Mark the output dir of the job for which the context is passed. 
    private void markOutputDirSuccessful(JobContext context) 
    throws IOException { 
    if (outputPath != null) { 
     FileSystem fileSys = outputPath.getFileSystem(context.getConfiguration()); 
     if (fileSys.exists(outputPath)) { 
     // create a file in the folder to mark it 
     Path filePath = new Path(outputPath, SUCCEEDED_FILE_NAME); 
     fileSys.create(filePath).close(); 
     } 
    } 
    }

因爲，markOutputDirSuccessful（）是私有的，你必須代替覆蓋commitJob（）繞過SUCCEEDED_FILE_NAME創建過程，並實現你想要什麼。

下一個目錄_logs是非常重要的，如果您希望以後使用hadoop HistoryViewer實際獲得Job的運行方式報告。

我認爲，當您使用相同的輸出目錄作爲另一個Job的輸入時，由於在Hadoop中設置了Filter，文件* _SUCCESS *和目錄* _logs *將被忽略。

此外，當您爲MultipleOutputs定義一個namedoutput時，您可以寫入TextOutputFormat.setOutputPath（）函數中提到的outpath中的子目錄，然後將該路徑用作下一個作業的輸入運行。

我實際上沒有看到_SUCCESS和_logs會如何打擾您？

感謝

來源

2013-09-24 19:53:23

問題是很老，仍然共用一個答案，

This回答適合好於問題的方案。

定義您的OutputFormat以表示您不期待任何輸出。你可以這樣來做：

job.setOutputFormat(NullOutputFormat.class);

或

也很可能使用LazyOutputFormat

import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat; 
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

現金@charlesmenguy

來源

2016-05-27 03:22:23

徹底取消默認輸出目錄 - MapReduce

回答

相關問題