同時向Hadoop中讀取兩個不同數據集的建議？

親愛的hadooper：我是hadoop的新手，最近嘗試實現一種算法。同時向Hadoop中讀取兩個不同數據集的建議？

該算法需要計算一個矩陣，它表示每兩對歌曲的不同等級。我已經這樣做了，輸出是一個600000 * 600000稀疏矩陣，我存儲在我的HDFS中。我們稱之爲數據集A（size = 160G）

現在，我需要閱讀用戶的配置文件以預測他們對特定歌曲的評分。所以，我需要先讀取用戶的個人資料（這是5G大小），讓調用這個數據集B，然後計算使用的數據集A.

但現在我不知道如何從一個讀兩個數據集單一的hadoop計劃。或者我可以將數據集B讀入RAM然後進行計算？（我猜我不行，因爲HDFS是一個分佈式系統，我不能將數據集B讀入單個機器的內存中）。

有什麼建議嗎？

2011-04-11 Ke Xie

這可能有助於http://stackoverflow.com/questions/4593243/hadoop-job-taking-input-files-from-multiple-directories – Nija 2011-04-11 10:14:22

我會建議你使用任何豬或蜂巢（谷歌爲他們）。然後將其作爲從用戶配置文件到歌曲數據的連接來實現。我也會研究Mahout Hadoop機器學習系統。通過其本地Java API實現Hadoop中的連接真的很煩人。 – 2011-04-11 22:39:26

Thx Spike ... Mahout確實爲SlopeOne預先計算了diff-matrix的實現，但它沒有提供Slopeone算法的完整hadoop版本。無論如何，我會嘗試配置蜂巢。謝謝你的建議 – 2011-04-16 11:57:40

Haddop允許您爲不同的文件夾使用不同的地圖輸入格式。因此，您可以從多個數據源中讀取數據，然後在Map函數中將其轉換爲特定類型，即在其他（String，SongSongRaiting）中獲得（String，User）並且Map簽名爲（String，Object）的情況下。第二步是選擇推薦算法，以某種方式加入這些數據，這樣agregator將擁有足夠的信息來計算推薦。

來源

2011-04-11 22:07:16 yura

您可以使用兩個Map函數，每個Map函數可以處理一個數據集，如果你想實現不同的處理。你需要在工作中註冊你的地圖。例如：

  public static class FullOuterJoinStdDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text> 
    { 
      private String person_name, book_title,file_tag="person_book#"; 
      private String emit_value = new String(); 
      //emit_value = ""; 
      public void map(LongWritable key, Text values, OutputCollector<Text,Text>output, Reporter reporter) 
        throws IOException 
      { 
        String line = values.toString(); 
        try 
        { 
          String[] person_detail = line.split(","); 
          person_name = person_detail[0].trim(); 
          book_title = person_detail[1].trim(); 
        } 
        catch (ArrayIndexOutOfBoundsException e) 
        { 
          person_name = "student name missing"; 
        } 
        emit_value = file_tag + person_name; 
        output.collect(new Text(book_title), new Text(emit_value)); 
      } 

    } 


     public static class FullOuterJoinResultDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text> 
    { 
      private String author_name, book_title,file_tag="auth_book#"; 
      private String emit_value = new String();

// emit_value =「」; public void map（LongWritable key，Text values，OutputCollectoroutput，Reporter reporter） throws IOException {line} line = values.toString（）; 嘗試 String [] author_detail = line.split（「，」）; author_name = author_detail [1] .trim（）; book_title = author_detail [0] .trim（）; } catch（ArrayIndexOutOfBoundsException e） { author_name =「Not Exam in Exam」; }

      emit_value = file_tag + author_name;          
         output.collect(new Text(book_title), new Text(emit_value)); 
       } 

      } 


     public static void main(String args[]) 
        throws Exception 
    { 

      if(args.length !=3) 
        { 
          System.out.println("Input outpur file missing"); 
          System.exit(-1); 
        } 


      Configuration conf = new Configuration(); 
      String [] argum = new GenericOptionsParser(conf,args).getRemainingArgs(); 
      conf.set("mapred.textoutputformat.separator", ","); 
      JobConf mrjob = new JobConf(); 
      mrjob.setJobName("Inner_Join"); 
      mrjob.setJarByClass(FullOuterJoin.class); 

      MultipleInputs.addInputPath(mrjob,new Path(argum[0]),TextInputFormat.class,FullOuterJoinStdDetMapper.class); 
      MultipleInputs.addInputPath(mrjob,new Path(argum[1]),TextInputFormat.class,FullOuterJoinResultDetMapper.class); 

      FileOutputFormat.setOutputPath(mrjob,new Path(args[2])); 
      mrjob.setReducerClass(FullOuterJoinReducer.class); 

      mrjob.setOutputKeyClass(Text.class); 
      mrjob.setOutputValueClass(Text.class); 

      JobClient.runJob(mrjob); 
    }

來源

2014-03-21 05:31:36 Tanveer

同時向Hadoop中讀取兩個不同數據集的建議？

回答

相關問題