Hadoop SequenceFile是否安全？

我讀了hadoop-1.0.4源代碼中的SequenceFile.java。我發現sync(long)方法它被用於在SequenceFile中將SequenceFile拆分爲MapReduce中的文件拆分時，在SequenceFile中查找「同步標記」（在文件創建時生成時爲16字節的MD5）。Hadoop SequenceFile是否安全？

/** Seek to the next sync mark past a given position.*/ 
public synchronized void sync(long position) throws IOException { 
    if (position+SYNC_SIZE >= end) { 
    seek(end); 
    return; 
    } 

    try { 
    seek(position+4);       // skip escape 
    in.readFully(syncCheck); 
    int syncLen = sync.length; 
    for (int i = 0; in.getPos() < end; i++) { 
     int j = 0; 
     for (; j < syncLen; j++) { 
     if (sync[j] != syncCheck[(i+j)%syncLen]) 
      break; 
     } 
     if (j == syncLen) { 
     in.seek(in.getPos() - SYNC_SIZE);  // position before sync 
     return; 
     } 
     syncCheck[i%syncLen] = in.readByte(); 
    } 
    } catch (ChecksumException e) {    // checksum failure 
    handleChecksumException(e); 
    } 
}

這些代碼只是查找包含與「同步標記」相同數據的數據序列。

我的疑問：
考慮一個情況：在SequenceFile數據會包含一個16個字節的數據序列一樣的「同步標記」，代碼上面會誤把那16字節的數據「同步標記「然後SequenceFile將不會被正確解析？

我沒有發現有關數據或同步標記的任何「轉義」操作。 SequenceFile如何可以二進制安全？我錯過了什麼嗎？

來源

2013-04-27 Shawn H

衝突在技術上是可行的，但事實上他們是不可能的。

從http://search-hadoop.com/m/VYVra2krg5t1：

出現在的PB的一個給定的隨機 16字節串的概率（均勻分佈）的數據爲約10^-23。您的數據中心更有可能被隕石（http://preshing.com/20110504/hash-collision-probabilities）剔除。

來源

2013-10-17 15:39:33 rodo

Hadoop SequenceFile是否安全？

回答

相關問題