1
我想在地圖上運行weka分類器並減少甚至加載200mb的整個arff文件導致堆空間錯誤,所以我想分割arff文件分塊,但必須維護塊信息,即每個塊中的參數屬性信息,以便在每個映射器中運行分類器。這裏是我試圖分割數據但不能用效率做的代碼,要將輸入的arff文件分割成更小的塊來處理非常大的數據集
List<InputSplit> splits = new ArrayList<InputSplit>();
for (FileStatus file: listStatus(job)) {
Path path = file.getPath();
FileSystem fs = path.getFileSystem(job.getConfiguration());
//number of bytes in this file
long length = file.getLen();
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
// make sure this is actually a valid file
if(length != 0) {
// set the number of splits to make. NOTE: the value can be changed to anything
int count = job.getConfiguration().getInt("Run-num.splits",1);
for(int t = 0; t < count; t++) {
//split the file and add each chunk to the list
splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
}
}
else {
// Create empty array for zero length files
splits.add(new FileSplit(path, 0, length, new String[0]));
}
}
return splits;
我也試試這個。 – Amogh