什麼信息描述了兩個相同大小的給定文件之間的數量差異？

通常，爲了找到兩個二進制文件是不同的，我使用diff和hexdump工具。但是在某些情況下，如果給出兩個相同大小的大型二進制文件，我想只看到它們的數量差異，如差異區域數量，累積差異。什麼信息描述了兩個相同大小的給定文件之間的數量差異？

示例：2文件A和B.它們有2個diff區域，它們的累積差值爲 6c-a3 + 6c-11 + 6f-6e + 20-22。

File A = 48 65 6c 6c 6f 2c 20 57 
File B = 48 65 a3 11 6e 2c 22 57 
       |--------| |--| 
       reg 1 reg 2

我怎樣才能獲得使用標準的GNU工具和Bash或者我應該更好地利用一個簡單的Python腳本這些信息呢？有關2個文件如何不同的其他統計信息也可能有用，但我不知道還有什麼以及如何測量？熵差異？差異差異？

來源

2011-10-20 psihodelia

你在linux中已經有'cmp'程序出了什麼問題？ http://en.wikipedia.org/wiki/Cmp_(Unix）。另外，如果文件大小不同（或者比較區域大小不同），結果如何？ –

您對「累計差異」的定義意味着'00 ff'和'ff 00'之間的累計差值爲0.是否打算這麼做？ –

作爲一個更一般的說明：沒有指定目的，要求其他「差異措施」是沒有意義的。您可以將這兩個文件視爲向量（字節）並使用任何有限維矢量範數。 –

對於除地區以外的任何東西，您都可以使用numpy。像這樣的（未經測試）的東西：

import numpy as np 
a = np.fromfile("file A", dtype="uint8") 
b = np.fromfile("file B", dtype="uint8") 

# Compute the number of bytes that are different 
different_bytes = np.sum(a != b) 

# Compute the sum of the differences 
difference = np.sum(a - b) 

# Compute the sum of the absolute value of the differences 
absolute_difference = np.sum(np.abs(a - b)) 

# In some cases, the number of bits that have changed is a better 
# measurement of change. To compute it we make a lookup array where 
# bitcount_lookup[byte] == number_of_1_bits_in_byte (so 
# bitcount_lookup[0:16] == [0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4]) 
bitcount_lookup = np.array(
    [bin(i).count("1") for i in range(256)], dtype="uint8") 

# Numpy allows using an array as an index.^computes the XOR of 
# each pair of bytes. The result is a byte with a 1 bit where the 
# bits of the input differed, and a 0 bit otherwise. 
bit_diff_count = np.sum(bitcount_lookup[a^b])

我找不到用於計算一個地區的功能numpy的，但只寫你自己的使用a != b作爲輸入，它不應該是很難。見this問題的靈感。

來源

2011-10-20 12:16:53

謝謝，它看起來像一個有趣的解決方案。 – psihodelia

想到的一個方法是對二進制差異算法進行一些修改。例如。 a python implementation of the rsync algorithm。從此開始，應該相對容易地獲得文件不同的塊範圍列表，然後在這些塊上執行任何想要執行的統計。

來源

2011-10-20 13:20:01 janneb

什麼信息描述了兩個相同大小的給定文件之間的數量差異？

回答

相關問題