2012-09-24 76 views
0

我正在嘗試編寫一個Python腳本,它將抓取目錄並查找所有重複的文件並報告重複項。最好的解決辦法是什麼?使用python查找重複文件

import os, sys 

def crawlDirectories(directoryToCrawl): 
    crawledDirectory = [os.path.join(path, subname) for path, dirnames, filenames in os.walk(directoryToCrawl) for subname in dirnames + filenames] 
    return crawledDirectory 

#print 'Files crawled',crawlDirectories(sys.argv[1]) 

directoriesWithSize = {} 
def getByteSize(crawledDirectory): 
    for eachFile in crawledDirectory: 
     size = os.path.getsize(eachFile) 
     directoriesWithSize[eachFile] = size 
    return directoriesWithSize 

getByteSize(crawlDirectories(sys.argv[1])) 

#print directoriesWithSize.values() 

duplicateItems = {} 

def compareSizes(dictionaryDirectoryWithSizes): 
    for key,value in dictionaryDirectoryWithSizes.items(): 
     if directoriesWithSize.values().count(value) > 1: 
      duplicateItems[key] = value 

compareSizes(directoriesWithSize) 

#print directoriesWithSize.values().count(27085) 

compareSizes(directoriesWithSize) 

print duplicateItems 

爲什麼這會拋出這個錯誤?

Traceback (most recent call last): 
    File "main.py", line 16, in <module> 
    getByteSize(crawlDirectories(sys.argv[1])) 
    File "main.py", line 12, in getByteSize 
    size = os.path.getsize(eachFile) 
    File  "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/genericpath.py", line 49, in getsize 
OSError: [Errno 2] No such file or directory:  '../Library/Containers/com.apple.ImageKit.RecentPictureService/Data/Documents/iChats' 
+0

運行>> python filename.py文件夾名稱首頁 –

+0

它似乎與符號鏈接有關。任何方式不抓取這些? – Matthew

回答

0

在我看來,你的crawledDirectory功能實在是太複雜:

def crawlDirectories(directoryToCrawl): 
    output = [] 
    for path, dirnames, filenames in os.walk(directoryToCrawl): 
     for fname in filenames: 
      output.append(os.path.join(path,fname)) 
    return output 
0

我建議嘗試:

def crawlDirectories(directoryToCrawl): 
    crawledDirectory = [os.path.realpath(os.path.join(p, f)) 
             for (p, d, f) in os.walk(directoryToCrawl)] 
return crawledDirectory 

也就是說,使用規範的路徑,而不是相對抓取路徑。