要將輸入的arff文件分割成更小的塊來處理非常大的數據集

我想在地圖上運行weka分類器並減少甚至加載200mb的整個arff文件導致堆空間錯誤，所以我想分割arff文件分塊，但必須維護塊信息，即每個塊中的參數屬性信息，以便在每個映射器中運行分類器。這裏是我試圖分割數據但不能用效率做的代碼，要將輸入的arff文件分割成更小的塊來處理非常大的數據集

List<InputSplit> splits = new ArrayList<InputSplit>(); 
     for (FileStatus file: listStatus(job)) { 
      Path path = file.getPath(); 
      FileSystem fs = path.getFileSystem(job.getConfiguration()); 

      //number of bytes in this file 
      long length = file.getLen(); 
      BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length); 

      // make sure this is actually a valid file 
      if(length != 0) { 
       // set the number of splits to make. NOTE: the value can be changed to anything 
       int count = job.getConfiguration().getInt("Run-num.splits",1); 
       for(int t = 0; t < count; t++) { 
        //split the file and add each chunk to the list 
        splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts())); 
       } 
      } 
      else { 
       // Create empty array for zero length files 
       splits.add(new FileSplit(path, 0, length, new String[0])); 
      } 
     } 
     return splits;

來源

2015-05-06 Amogh

您是先試過這個嗎？

在mapred-site.xml中，添加該屬性：

<property> 
    <name>mapred.child.java.opts</name> 
    <value>-Xmx2048m</value> 
</property>

//內存分配MR工作

來源

2015-05-15 09:20:06

我也試試這個。 – Amogh

要將輸入的arff文件分割成更小的塊來處理非常大的數據集

回答

相關問題