2015-05-06 56 views
1

我想在地圖上運行weka分類器並減少甚至加載200mb的整個arff文件導致堆空間錯誤,所以我想分割arff文件分塊,但必須維護塊信息,即每個塊中的參數屬性信息,以便在每個映射器中運行分類器。這裏是我試圖分割數據但不能用效率做的代碼,要將輸入的arff文件分割成更小的塊來處理非常大的數據集

List<InputSplit> splits = new ArrayList<InputSplit>(); 
     for (FileStatus file: listStatus(job)) { 
      Path path = file.getPath(); 
      FileSystem fs = path.getFileSystem(job.getConfiguration()); 

      //number of bytes in this file 
      long length = file.getLen(); 
      BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length); 

      // make sure this is actually a valid file 
      if(length != 0) { 
       // set the number of splits to make. NOTE: the value can be changed to anything 
       int count = job.getConfiguration().getInt("Run-num.splits",1); 
       for(int t = 0; t < count; t++) { 
        //split the file and add each chunk to the list 
        splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts())); 
       } 
      } 
      else { 
       // Create empty array for zero length files 
       splits.add(new FileSplit(path, 0, length, new String[0])); 
      } 
     } 
     return splits; 

回答

0

您是先試過這個嗎?

在mapred-site.xml中,添加該屬性:

<property> 
    <name>mapred.child.java.opts</name> 
    <value>-Xmx2048m</value> 
</property> 

//內存分配MR工作

+0

我也試試這個。 – Amogh

相關問題