如何使用列表目錄和路徑優化來優化搜索？

Python 2.7.5 Win/Mac。如何使用列表目錄和路徑優化來優化搜索？

我試圖尋找到多個存儲器（約128Tio）搜索文件（10000）的最佳方式。這些文件有特定的擴展名，我可以忽略一些文件夾。

這是我與os.listdir和遞歸第一個功能：

count = 0 
def SearchFiles1(path): 
    global count 
    pathList = os.listdir(path) 
    for i in pathList: 
     subPath = path+os.path.sep+i 
     if os.path.isfile(subPath) == True : 
      fileName = os.path.basename(subPath) 
      extension = fileName[fileName.rfind("."):] 
      if ".ext1" in extension or ".ext2" in extension or ".ext3" in extension: 
       count += 1 
       #do stuff . . . 
     else : 
      if os.path.isdir(subPath) == True: 
       if not "UselessFolder1" in subPath and not "UselessFolder1" in subPath: 
        SearchFiles1(subPath)

它的工作原理，但我認爲它可能是更好的（更快和正確的）還是我錯了？

所以，我想os.path.walk：

def SearchFiles2(path): 
    count = 0 
    for dirpath, subdirs, files in os.walk(path): 
     for i in dirpath: 
      if not "UselessFolder1" in i and not "UselessFolder1" in i: 
       for y in files: 
        fileName = os.path.basename(y) 
        extension = fileName[fileName.rfind("."):] 
        if ".ext2" in extension or ".ext2" in extension or ".ext3" in extension: 
         count += 1 
         # do stuff . . . 
    return count

「數」是錯誤的，一種方式速度較慢。我想我不太瞭解path.walk是如何工作的。

我的問題是：我能做些什麼來優化這個研究？

來源

2015-11-18 Syrius

你的第一個解決方案是合理的，除了你可以使用os.path.splitext。在第二種解決方案中，它不正確，因爲您重新訪問每個子目錄的文件列表，而不是僅處理一次。與os.path.walk訣竅是從subdirs刪除的目錄不是下一輪枚舉的一部分。

def SearchFiles2(path): 
    useless_dirs = set(("UselessFolder1", "UselessFolder2")) 
    useless_files = set((".ext1", ".ext2")) 
    count = 0 
    for dirpath, subdirs, files in os.walk(path): 
     # remove unwanted subdirs from future enumeration 
     for name in set(subdirs) & useless_dir: 
      subdirs.remove(name) 
     # list of interesting files 
     myfiles = [os.path.join(dirpath, name) for name in files 
      if os.path.splitext(name)[1] not in useless_files] 
     count += len(myfiles) 
     for filepath in myfiles: 
      # example shows file stats 
      print(filepath, os.stat(filepath) 
    return count

枚舉單個存儲單元的文件系統只能這麼快。加快這一點的最好方法是在不同線程中運行不同存儲單元的枚舉。

來源

2015-11-18 16:46:47 tdelaney

感謝你的例子，我改進了第一個解決方案（os.path.splitext和比較字符串到元組內容）。速度更快一些，我們可以輕鬆添加更多規則（文件分機/忽略子目錄）。 – Syrius

對於第二個解決方案，我沒有設法使其工作。首先，我猜這是第7行中的'useless_dirs'，但是我得到了錯誤：'ValueError：list.remove（x）：x not in list'。我添加了「打印名稱」，並看到它嘗試從子目錄中刪除useless_dirs [x]，即使它不存在。 – Syrius

@Syrius我的不好...當我應該使用'＆'時，我使用了'和'。 – tdelaney

所以，測試和討論tdelaney後，我既優化解決方案如下：

import os 

count = 0 
target_files = set((".ext1", ".ext2", ".ext3")) # etc 
useless_dirs = set(("UselessFolder2", "UselessFolder2")) # etc 
# it could be target_dirs, just change `in` with `not in` when compared. 

def SearchFiles1(path): 
    global count 
    pathList = os.listdir(path) 
    for content in pathList: 
     fullPath = os.path.join(path,content) 
     if os.path.isfile(fullPath): 
      if os.path.splitext(fullPath)[1] in target_files: 
       count += 1 
       #do stuff with 'fullPath' . . . 
     else : 
      if os.path.isdir(fullPath): 
       if fullPath not in useless_dirs: 
        SearchFiles1(fullPath) 

def SearchFiles2(path): 
    count = 0 
    for dirpath, subdirs, files in os.walk(path): 
     for name in set(subdirs) & useless_dirs: 
      subdirs.remove(name) 
     for filename in [name for name in files if os.path.splitext(name)[1] in target_files]: 
      count += 1 
      fullPath = os.path.join(dirpath, filename) 
      #do stuff with 'fullPath' . . . 
    return count

它正常工作在Mac/PC v2.7.5

關於速度這是完全均勻。

來源

2015-11-20 16:23:46 Syrius

如何使用列表目錄和路徑優化來優化搜索？

回答

相關問題