爲什麼Hadoop SequenceFile的寫入比讀取要慢得多？

我正在使用Java API將一些自定義文件轉換爲hadoop序列文件。爲什麼Hadoop SequenceFile的寫入比讀取要慢得多？

我從本地文件讀取的字節數組，並把它們添加到一個序列文件作爲對指數（整數）的 - 數據（字節[]）：

InputStream in = new BufferedInputStream(new FileInputStream(localSource)); 
FileSystem fs = FileSystem.get(URI.create(hDFSDestinationDirectory),conf); 
Path sequenceFilePath = new Path(hDFSDestinationDirectory + "/"+ "data.seq"); 

IntWritable key = new IntWritable(); 
BytesWritable value = new BytesWritable(); 
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, 
      sequenceFilePath, key.getClass(), value.getClass()); 

    for (int i = 1; i <= nz; i++) { 
    byte[] imageData = new byte[nx * ny * 2]; 
    in.read(imageData); 

    key.set(i); 
    value.set(imageData, 0, imageData.length); 
    writer.append(key, value); 
    } 
IOUtils.closeStream(writer); 
in.close();

我做的正是我所想要的逆把文件恢復到初始格式：

for (int i = 1; i <= nz; i++) { 
     reader.next(key, value); 
     int byteLength = value.getLength(); 
     byte[] tempValue = value.getBytes(); 
     out.write(tempValue, 0, byteLength); 
     out.flush(); 
    }

我注意到書面方式向SequenceFile花費幅度差不多一個數量級比讀書。我期望寫作比閱讀慢，但這種差異是否正常？爲什麼？

更多信息： 字節陣列讀我是2MB大小（NX = ny的= 1024和nz = 128）
我在僞分佈式模式下測試。

來源

2012-03-02 fgrollio

時間單位什麼是「數量級」？ – 2012-03-04 16:19:30

「十倍以上」 – fgrollio 2012-03-06 08:06:37

您正在從本地磁盤讀取數據並寫入HDFS。當您寫入HDFS時，您的數據可能正在被複制，因此根據您爲複製因子設置的內容，其物理寫入兩到三次。

因此，您不僅可以書寫而且可以書寫兩到三倍的數據量。你的寫作正在通過網絡進行。你的閱讀不是。

來源

2012-03-02 14:29:59

我正在僞分佈式模式下測試，所以我沒有複製，也沒有網絡流量。請不要指出它。 – fgrollio 2012-03-02 15:05:53

是nx和ny常量？

你可能會看到這個的一個原因是for循環的每次迭代都會創建一個新的字節數組。這需要JVM爲您分配一些堆空間。如果陣列足夠大，這將會很昂貴，並且最終你會碰到GC。但我不太確定HotSpot可以做什麼來優化這一點。

我的建議是建立一個單一的BytesWritable：

// use DataInputStream so you can call readFully() 
DataInputStream in = new DataInputStream(new FileInputStream(localSource)); 
FileSystem fs = FileSystem.get(URI.create(hDFSDestinationDirectory),conf); 
Path sequenceFilePath = new Path(hDFSDestinationDirectory + "/"+ "data.seq"); 

IntWritable key = new IntWritable(); 
// create a BytesWritable, which can hold the maximum possible number of bytes 
BytesWritable value = new BytesWritable(new byte[maxPossibleSize]); 
// grab a reference to the value's underlying byte array 
byte byteBuf[] = value.getBytes(); 
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, 
     sequenceFilePath, key.getClass(), value.getClass()); 

for (int i = 1; i <= nz; i++) { 
    // work out how many bytes to read - if this is a constant, move outside the for loop 
    int imageDataSize nx * ny * 2; 
    // read in bytes to the byte array 
    in.readFully(byteBuf, 0, imageDataSize); 

    key.set(i); 
    // set the actual number of bytes used in the BytesWritable object 
    value.setSize(imageDataSize); 
    writer.append(key, value); 
} 

IOUtils.closeStream(writer); 
in.close();

來源

2012-03-21 00:48:56

是的nx，nz是常量，我會試試這個，謝謝你的詳細解答。 – fgrollio 2012-03-27 14:20:43

fgrollio，是否有助於提高性能？ – 2013-02-26 15:10:49

爲什麼Hadoop SequenceFile的寫入比讀取要慢得多？

回答

相關問題