遍歷目錄以統計特定字符串的文件數

我有一個包含多個子目錄級別的目錄。目錄中的所有文件都是html文件（總共約500個文件），我想通過每個文件查看是否包含「sub_middle_1col」分區。我在palewire.com網站找到了一個很好的教程，並用它作爲我的基礎。我遇到的兩個困難是：1）代碼在打到一個子目錄（認爲它是一個文件）時破壞了，2）它不會遍歷子目錄 - 也就是說，它只查看沒有任何文件的文件子目錄。我可能通過添加一行來解決第一個問題（如下所述），但無法弄清楚如何將其他解決方案（例如os.walk）集成到代碼中以解決第二個問題。有任何想法嗎？提前感謝您的任何建議。遍歷目錄以統計特定字符串的文件數

import os 

path = "./Industries" 
my_library = os.listdir(path) 
out = open("out.txt", "w") 

for page in my_library: 
    file = os.path.join(path, page) 
    if os.path.isfile(file) and file.endswith('.html'): #I ADDED THIS LINE 
     text = open(file, "r") 
     hit_count = 0 
     for line in text: 
      if 'sub_middle_1col' in line: 
       hit_count = hit_count + 1 
       print >> out, page + " => " + str(hit_count) 
     print page + " => " + str(hit_count) 
     text.close()

來源

2011-02-11 Gregory Saxton

你有沒有考慮過`grep -c -r sub_middle_1col。/ Industries`？ – MattH 2011-02-11 13:32:07

謝謝，馬特。我從來沒有聽說過。我試了一下，它的工作！我仍然需要Python代碼 - 我將在後面對目錄中的文件做不同的事情。 – 2011-02-11 16:20:48

好了，你可以試試：

import os 

for root,dirs,files in os.walk(path): 
    for fname in files: 
     if fname.endswith('.html'): 
      fq = os.path.join(root, fname) 
      for line in open(fq): 
       if 'sub_middle_1col' in line: 
        ...

查找（）或REG。表達式（重新模塊）來檢查'sub_middle_1col'字符串可以給你更好的性能...

來源

2011-02-11 13:31:58

遍歷目錄以統計特定字符串的文件數

回答

相關問題