使用python散列文件

我正在編寫一個python腳本，它應該在cwd中查找所有具有相同內容的文件。我的想法是使用哈希函數，但是當我運行腳本時，即使每個文件都是副本，每個文件都會得到不同的摘要，如果我在終端上計算它們，則不會發生這種情況。我無法弄清楚問題出在哪裏。下面的代碼使用python散列文件

import sys 
import os 
import hashlib 
from collections import defaultdict 

blocksize = 65536 

def hashfile(file, hasher): 
    buf = file.read(blocksize) 
    while len(buf)>0: 
     hasher.update(buf) 
     buf = file.read(blocksize) 
    #print hasher.hexdigest() 
    return hasher.hexdigest() 

def main(): 
    dir = os.getcwd() 
    files = os.listdir(dir) 
    dict = defaultdict(list) 
    l = [] 
    hasher = hashlib.sha256() 

    for file in files: 
     hash = hashfile(open(file, 'rb'), hasher) 
     l.append((hash, file)) 

    for k, v in l: 
     dict[k].append(v) 

    for k in dict.items(): 
     print k 


if __name__ == '__main__': 
    main()

來源

2015-05-27 Mirko Mucaria

您正在使用的所有文件的單個hasher和它的被累計更新。當你處理第二個文件時，你會得到第一個和第二個文件的摘要。

#hasher = hashlib.sha256() 

    for file in files: 
     hasher = hashlib.sha256() 
     hash = hashfile(open(file, 'rb'), hasher) 
     l.append((hash, file))

將hasher = hashlib.sha256()行移動到for循環。

我覺得這是更好地移動hasher = hashlib.sha256()到hashfile功能：

def hashfile(file): 
    hasher = hashlib.sha256() 
    buf = file.read(blocksize) 
    #original code here

它會使代碼更清晰。

來源

2015-05-27 09:59:01 WKPlus

使用python散列文件

回答

相關問題