從另一個Java程序運行Hadoop作業

我正在編寫一個程序，它接收mapper/reducers的源代碼，動態編譯mappers/reducers並將JAR文件移出它們。然後它必須在hadoop集羣上運行這個JAR文件。從另一個Java程序運行Hadoop作業

對於最後一部分，我通過我的代碼動態設置了所有必需的參數。但是，我現在面臨的問題是，代碼在編譯時需要編譯的mapper和reducer類。但在編譯時，我沒有這些類，它們將在運行時間後被接收（例如，通過從遠程節點接收的消息）。對於如何通過這個問題我有任何想法/建議嗎？

下面你可以找到我的最後一部分的代碼，其中的問題是job.setMapperClass（Mapper_Class.class）和job.setReducerClass（Reducer_Class.class）需要類（Mapper_Class.class和Reducer_Class.class）文件出席編譯時間：

private boolean run_Hadoop_Job(String className){ 
try{ 
    System.out.println("Starting to run the code on Hadoop..."); 
    String[] argsTemp = { "project_test/input", "project_test/output" }; 
    // create a configuration 
    Configuration conf = new Configuration(); 
    conf.set("fs.default.name", "hdfs://localhost:54310"); 
    conf.set("mapred.job.tracker", "localhost:54311"); 
    conf.set("mapred.jar", jar_Output_Folder+ java.io.File.separator 
          + className+".jar"); 
    conf.set("mapreduce.map.class", "Mapper_Reducer_Classes$Mapper_Class.class"); 
    conf.set("mapreduce.reduce.class", "Mapper_Reducer_Classes$Reducer_Class.class"); 
    // create a new job based on the configuration 
    Job job = new Job(conf, "Hadoop Example for dynamically and programmatically compiling-running a job"); 
    job.setJarByClass(Platform.class); 
    //job.setMapperClass(Mapper_Class.class); 
    //job.setReducerClass(Reducer_Class.class); 

    // key/value of your reducer output 
    job.setOutputKeyClass(Text.class); 
    job.setOutputValueClass(IntWritable.class); 

    FileInputFormat.addInputPath(job, new Path(argsTemp[0])); 
    // this deletes possible output paths to prevent job failures 
    FileSystem fs = FileSystem.get(conf); 
    Path out = new Path(argsTemp[1]); 
    fs.delete(out, true); 
    // finally set the empty out path 
    FileOutputFormat.setOutputPath(job, new Path(argsTemp[1])); 

    //job.submit(); 
    System.exit(job.waitForCompletion(true) ? 0 : 1); 
    System.out.println("Job Finished!");   
} catch (Exception e) { return false; } 
return true; 
}

修訂：所以我使用conf.set（「mapreduce.map.class，‘我mapper.class’）修改了代碼，以指定映射器和減壓器現在。代碼編譯正確，但執行時會拋出以下錯誤：

ec 24，2012 6:49:43 AM org.apache.hadoop.mapred.JobClien牛逼monitorAndPrintJob 信息：任務標識：attempt_201212240511_0006_m_000001_2，狀態：失敗了java.lang.RuntimeException：拋出java.lang.ClassNotFoundException：Mapper_Reducer_Classes $ Mapper_Class.class 在org.apache.hadoop.conf.Configuration.getClass（Configuration.java:809 ） at org.apache.hadoop.mapreduce.JobContext.getMapperClass（JobContext.java:157） at org.apache.hadoop.mapred.MapTask.runNewMapper（MapTask.java:569） at org.apache.hadoop.mapred .MapTask.run（MapTask.java:305） at org.apache.hadoop.mapred.Child.main（Child.java:170）

來源

2012-12-23 reza

如果你在編譯時沒有它們，那麼直接設置名字在這樣的配置中：

conf.set("mapreduce.map.class", "org.what.ever.ClassName"); 
conf.set("mapreduce.reduce.class", "org.what.ever.ClassName");

來源

2012-12-23 13:41:24

您必須將'Hadoop' jar添加到名爲'tmpjars'的屬性。所以它會像這樣工作：'conf.set（「tmpjars」，「/usr/local/hadoop/hadoop-core.jar,/usr/local/hadoop/hadoop-example.jar）'。必須分開Jar路徑以逗號分隔。請注意，這很不方便，您必須注意，這些jar實際上存在於客戶端計算機上（爲了讓Hadoop將它複製到HDFS並將其下載到任務跟蹤器）。 –

謝謝托馬斯。我想出了這部分，我的代碼現在編譯正確。但是在執行期間它會引發一些錯誤。我修改了我的初始帖子以反映這一點。任何想法？ – reza

您是否明確將映射器所在的jar添加到'tmpjars'中？ –

您只需要對將被動態創建的類的Class對象的引用。使用Class.for name("foo.Mapper")而不是foo.Mapper.class

來源

2012-12-23 14:24:01

問題是TaskTracker無法在您的本地jRE中看到類。

我想通過這種方式（Maven項目）;

首先，在Java源代碼中添加這個插件的pom.xml，它會構建應用程序的jar文件，包括所有的依賴罐子，

<build> 
    <plugins> 
     <plugin> 
      <groupId>org.apache.maven.plugins</groupId> 
      <artifactId>maven-shade-plugin</artifactId> 
      <executions> 
       <execution> 
        <phase>package</phase> 
        <goals> 
         <goal>shade</goal> 
        </goals> 
       </execution> 
      </executions> 
      <configuration> 
       <filters> 
        <filter> 
         <artifact>*:*</artifact> 
         <excludes> 
          <exclude>META-INF/*.SF</exclude> 
          <exclude>META-INF/*.DSA</exclude> 
          <exclude>META-INF/*.RSA</exclude> 
         </excludes> 
        </filter> 
       </filters> 
       <finalName>sample</finalName> 
       <!-- 
       <finalName>uber-${artifactId}-${version}</finalName> 
       --> 
      </configuration> 
     </plugin> 
    </plugins> 
    </build>

，添加這些行，它將包括你的樣品。 jar被構建爲通過pom.xml中的標籤來定位/ sample.jar。

 Configuration config = new Configuration(); 
     config.set("fs.default.name", "hdfs://ip:port"); 
     config.set("mapred.job.tracker", "hdfs://ip:port"); 

     JobConf job = new JobConf(config); 
     job.setJar("target/sample.jar");

通過這種方式，您的任務管理器可以引用您編寫的類並且不會發生ClassNotFoundException。

來源

2013-08-26 02:37:23

這是最好的答案。您可能不想使用包含hadoop作業所需的所有事件的着色jar，並將所有這些東西保存在外部java程序的類路徑中。可能存在jar衝突或其他問題。通過路徑引用陰影的jar允許從外部程序中抽象出來，並通過內置的API發送到hadoop集羣。您可以構建一個由外部程序使用的僅包含該程序所需的特定依賴項的不同jar。 – Galuvian

從另一個Java程序運行Hadoop作業

回答

相關問題