我正在使用Java API將一些自定義文件轉換爲hadoop序列文件。爲什麼Hadoop SequenceFile的寫入比讀取要慢得多?
我從本地文件讀取的字節數組,並把它們添加到一個序列文件作爲對指數(整數)的 - 數據(字節[]):
InputStream in = new BufferedInputStream(new FileInputStream(localSource));
FileSystem fs = FileSystem.get(URI.create(hDFSDestinationDirectory),conf);
Path sequenceFilePath = new Path(hDFSDestinationDirectory + "/"+ "data.seq");
IntWritable key = new IntWritable();
BytesWritable value = new BytesWritable();
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf,
sequenceFilePath, key.getClass(), value.getClass());
for (int i = 1; i <= nz; i++) {
byte[] imageData = new byte[nx * ny * 2];
in.read(imageData);
key.set(i);
value.set(imageData, 0, imageData.length);
writer.append(key, value);
}
IOUtils.closeStream(writer);
in.close();
我做的正是我所想要的逆把文件恢復到初始格式:
for (int i = 1; i <= nz; i++) {
reader.next(key, value);
int byteLength = value.getLength();
byte[] tempValue = value.getBytes();
out.write(tempValue, 0, byteLength);
out.flush();
}
我注意到書面方式向SequenceFile花費幅度差不多一個數量級比讀書。我期望寫作比閱讀慢,但這種差異是否正常?爲什麼?
更多信息: 字節陣列讀我是2MB大小(NX = ny的= 1024和nz = 128)
我在僞分佈式模式下測試。
時間單位什麼是「數量級」? – 2012-03-04 16:19:30
「十倍以上」 – fgrollio 2012-03-06 08:06:37