使用-files參數將文件傳遞到Hadoop

我有一個本地正確執行的MapReduce程序。使用-files參數將文件傳遞到Hadoop

它使用一種稱爲新positions.csv文件中的映射類的設置（）方法來填充哈希表在內存：

public void setup(Context context) throws IOException, InterruptedException { 
     newPositions = new Hashtable<String, Integer>(); 
     File file = new File("new-positions.csv"); 

     Scanner inputStream = new Scanner(file); 
     String line = null; 
     String firstline = inputStream.nextLine(); 
     while(inputStream.hasNext()){ 
      line = inputStream.nextLine(); 
      String[] splitLine = line.split(","); 
      Integer id = Integer.valueOf(splitLine[0].trim()); 
      // String firstname = splitLine[1].trim(); 
      // String surname = splitLine[2].trim(); 
      String[] emails = new String[4]; 
      for (int i = 3; i < 7; i++) { 
       emails[i-3] = splitLine[i].trim(); 
      } 
      for (String email : emails) { 
       if (!email.equals("")) newPositions.put(email, id); 
      } 
      // String position = splitLine[7].trim(); 
      inputStream.close(); 
     } 
    }

的Java程序已出口到可執行的JAR。該JAR和full-positions.csv都保存在我們本地文件系統的同一目錄中。

然後，而目錄中，我們執行下面的在終端（我們也與新positions.csv完整路徑試了一下）：

hadoop jar MR2.jar Reader2 -files new-positions.csv InputDataset OutputFolder

它執行罰款，但是當它到達我們得到：

Error: java.io.FileNotFoundException: new-positions.csv (No such file or directory)

這個文件肯定存在本地，我們肯定是從該目錄內執行。

我們遵循Hadoop中給出的指導：權威指南（第4版），p。 274以後，並且看不到我們的程序和論點在結構上有何不同。

難道這與Hadoop配置有關嗎？我們知道有一些解決方法，比如將文件複製到HDFS然後從那裏執行，但是我們需要理解爲什麼這個「-files」參數沒有按預期工作。

編輯：下面是從驅動器類，它也可以是問題的根源的一些代碼：

公衆詮釋運行（字串[] args）拋出IOException異常，InterruptedException的，ClassNotFoundException的{ 如果（參數。長度！= 5）{ printUsage（this，「」）; return 1; }

 Configuration config = getConf(); 

    FileSystem fs = FileSystem.get(config); 

    Job job = Job.getInstance(config); 
    job.setJarByClass(this.getClass()); 
    FileInputFormat.addInputPath(job, new Path(args[3])); 

    // Delete old output if necessary 
    Path outPath = new Path(args[4]); 
    if (fs.exists(outPath)) 
     fs.delete(outPath, true); 

    FileOutputFormat.setOutputPath(job, new Path(args[4])); 

    job.setInputFormatClass(SequenceFileInputFormat.class); 

    job.setOutputKeyClass(NullWritable.class); 
    job.setOutputValueClass(Text.class); 

    job.setMapOutputKeyClass(EdgeWritable.class); 
    job.setMapOutputValueClass(NullWritable.class); 

    job.setMapperClass(MailReaderMapper.class); 
    job.setReducerClass(MailReaderReducer.class); 

    job.setJar("MR2.jar"); 


    boolean status = job.waitForCompletion(true); 
    return status ? 0 : 1; 
} 

public static void main(String[] args) throws Exception { 
    int exitCode = ToolRunner.run(new Reader2(), args); 
    System.exit(exitCode); 
}

來源

2016-04-18 ajrwhite

讓我們假設你的「新positions.csv」存在於文件夾中：H:/HDP/，那麼你需要這個文件傳遞爲：

file:///H:/HDP/new-positions.csv

您需要限定路徑與file:///，以表明它是本地文件系統路徑。另外，您需要傳遞完全限定的路徑。

這對我來說非常適合。

對於例如，我通過如下本地文件myini.ini：

yarn jar hadoop-mapreduce-examples-2.4.0.2.1.5.0-2060.jar teragen -files "file:///H:/HDP/hadoop-2.4.0.2.1.5.0-2060/share/hadoop/common/myini.ini" -Dmapreduce.job.maps=10 10737418 /usr/teraout/

來源

2016-04-18 15:54:58

新的命令如下所示：hadoop jar MR2.jar Reader2 -files file：///home/local/xxx360/FinalProject/new-positions.csv InputDataset OutputFolder ...我得到同樣的錯誤試圖訪問「 Java程序中的「new-positions.csv」。它可能是我們的Hadoop配置中的東西嗎？ – ajrwhite

用雙引號給出整個路徑 –

仍然不起作用 - 我想知道問題出在我的驅動程序類中。我將用其他信息編輯主要問題。 – ajrwhite

我覺得Manjunath Ballur給你一個正確的答案，但你通過URI，file:///home/local/xxx360/FinalProject/new-positions.csv可能無法從Hadoop的解析工人機器。

該路徑看起來像機器上的絕對路徑，但哪臺機器包含home？添加一個服務器到路徑，我認爲它可能工作。

或者，如果您使用單數-file，它看起來像Hadoop將複製文件，而不是像-files那樣創建符號鏈接。

請參閱文檔here。

來源

2017-06-24 07:06:09

使用-files參數將文件傳遞到Hadoop

回答

相關問題