查找所有文本文件不包含一些文本字符串

我在Python 2.7.1我試圖識別所有文本文件不包含包含一些文本字符串。查找所有文本文件不包含一些文本字符串

該程序似乎開始工作，但每當我將文本字符串添加到一個文件，它不斷出現，如果它不包含它（誤報）。當我檢查文本文件的內容時，字符串顯然是存在的。

我試着寫的代碼是

def scanFiles2(rdir,sstring,extens,start = '',cSens = False): 
    fList = [] 
    for fol,fols,fils in os.walk(rdir): 
     fList.extend([os.path.join(rdir,fol,fil) for fil in fils if fil.endswith(extens) and fil.startswith(start)]) 
    if fList: 
     for fil in fList: 
      rFil = open(fil) 
      for line in rFil: 
       if not cSens: 
        line,sstring = line.lower(), sstring.lower() 
       if sstring in line: 
        fList.remove(fil) 
        break 
      rFil.close() 
    if fList: 
     plur = 'files do' if len(fList) > 1 else 'file does' 
     print '\nThe following %d %s not contain "%s":\n'%(len(fList),plur,sstring) 
     for fil in fList: 
      print fil 
    else: 
     print 'No files were found that don\'t contain %(sstring)s.'%locals() 
scanFiles2(rdir = r'C:\temp',sstring = '!!syn',extens = '.html', start = '#', cSens = False)

我想有一個缺陷的代碼，但我真的沒有看到它。

UPDATE

的代碼仍然出現了許多誤報：文件做包含搜索字符串，但被確定爲不包含它。

可能文本編碼是一個問題嗎？我在U之前加上了搜索字符串以解釋Unicode編碼，但它沒有任何區別。

Python在某種程度上緩存文件內容？我不這麼認爲，但這可能會導致文件在糾正後仍然彈出。

可能某種惡意軟件會導致類似這些症狀嗎？似乎不太可能對我來說，但我有點絕望得到這個固定。

來源

2013-12-13 RubenGeert

我試過了，它對我很有用（只是改變了「extens」和「rdir」以匹配我當前的env） –

@le_vine：這很好，但對我來說它仍然包含一些**做**包括搜索字符串。我應該補充一點，搜索字符串最近被添加到他們。任何想法可能會發生什麼？就好像Python從緩存而不是磁盤獲取文件內容或者其他東西一樣...... – RubenGeert

代碼中使用的命名約定並不是最好的。代碼中有太多的'fil'，'fLi'。試着大聲朗讀代碼。嘗試使用相應函數的文檔中的名稱，例如'dirpath，dirnames，filenames'而不是'fol，fols，fils' – jfs

修改元素，同時迭代列表導致意外的結果：

例如：

>>> lst = [1,2,4,6,3,8,0,5] 
>>> for n in lst: 
...  if n % 2 == 0: 
...   lst.remove(n) 
... 
>>> lst 
[1, 4, 3, 0, 5]

解決方法疊代複製

>>> lst = [1,2,4,6,3,8,0,5] 
>>> for n in lst[:]: 
...  if n % 2 == 0: 
...   lst.remove(n) 
... 
>>> lst 
[1, 3, 5]

或者，您也可以追加有效文件路徑，而不是remo從整個文件列表中刪除。

修改版（附加文件不的contian sstring而不是刪除）：

def scanFiles2(rdir, sstring, extens, start='', cSens=False): 
    if not cSens: 
     # This only need to called once. 
     sstring = sstring.lower() 
    fList = [] 
    for fol, fols, fils in os.walk(rdir): 
     for fil in fils: 
      if not (fil.startswith(start) and fil.endswith(extens)): 
       continue 
      fil = os.path.join(fol, fil) 
      with open(fil) as rFil: 
       for line in rFil: 
        if not cSens: 
         line = line.lower() 
        if sstring in line: 
         break 
       else: 
        fList.append(fil) 
    ...

list.remove需要O（n）的時間，而需要list.append O（1）。見Time Complexity。
如果可能，請使用with聲明。

來源

2013-12-15 07:27:21 falsetru

要查找文件，請考慮[glob]（http://docs.python.org/2/庫/ glob.html）？ – Ray

@Ray，'glob.glob'不會遞歸。可以使用'glob.glob'，但是'os.walk'已經獲得了文件列表，所以似乎沒有必要。你的意思是['fnmatch.fnamtch']（http://docs.python.org/2/library/fnmatch.html#fnmatch.fnmatch）？ – falsetru

哦，現在我明白了。 'os.walk'工作得很好。 :) – Ray

Falsetru已經向你展示了爲什麼你不應該從列表中移除行，列表迭代器不會在列表縮短時不更新其計數器，因此如果處理了項目3但刪除了該項目，則下一個迭代項目4先前位於索引5處。

列表理解的版本使用fnmatch.filter()和any()和不區分大小寫的匹配濾波器lambda：

import fnmatch 

def scanFiles2(rdir, sstring, extens, start='', cSens=False): 
    lfilter = sstring.__eq__ if cSens else lambda l, s=sstring.lower(): l.lower() == s 
    ffilter = '{}*{}'.format(start, extens) 
    return [os.path.join(r, fname) 
      for r, _, f in os.walk(rdir) 
      for fname in fnmatch.filter(f, ffilter) 
      if not any(lfilter(l) for l in open(os.path.join(root, fname)))]

但也許你會更好堅持一個更可讀的循環：

def scanFiles2(rdir, sstring, extens, start='', cSens=False): 
    lfilter = sstring.__eq__ if cSens else lambda l, s=sstring.lower(): l.lower() == s 
    ffilter = '{}*{}'.format(start, extens) 
    result = [] 
    for root, _, files in os.walk(rdir): 
     for fname in fnmatch.filter(files, ffilter): 
      fname = os.path.join(r, fname) 
      with open(fname) as infh: 
       if not any(lfilter(l) for l in infh): 
        result.append(fname) 
    return result

來源

2013-12-21 13:34:03

另一個替代方案，打開搜索使用正則表達式（雖然只使用grep與適當的選項仍然會更好）：

import mmap 
import os 
import re 
import fnmatch 

def scan_files(rootdir, search_string, extension, start='', case_sensitive=False): 
    rx = re.compile(re.escape(search_string), flags=re.I if not case_sensitive else 0) 
    name_filter = start + '*' + extension 
    for root, dirs, files in os.walk(rootdir): 
     for fname in fnmatch.filter(files, name_filter): 
      with open(os.path.join(root, fname)) as fin: 
       try: 
        mm = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ) 
       except ValueError: 
        continue # empty files etc.... include this or not? 
       if not next(rx.finditer(mm), None): 
        yield fin.name

然後使用上list如果你想要的名稱物化或把它當作你的任何其他發電機...

來源

2013-12-21 14:27:08

請不要寫一個Python程序。這個程序已經存在。使用grep：

grep * -Ilre 'main' 2> /dev/null 
99client/.git/COMMIT_EDITMSG 
99client/taxis-android/build/incremental/mergeResources/production/merger.xml 
99client/taxis-android/build/incremental/mergeResources/production/inputs.data 
99client/taxis-android/build/incremental/mergeResources/production/outputs.data 
99client/taxis-android/build/incremental/mergeResources/release/merger.xml 
99client/taxis-android/build/incremental/mergeResources/release/inputs.data 
99client/taxis-android/build/incremental/mergeResources/release/outputs.data 
99client/taxis-android/build/incremental/mergeResources/debug/merger.xml 
99client/taxis-android/build/incremental/mergeResources/debug/inputs.data 
(...)

http://www.gnu.org/savannah-checkouts/gnu/grep/manual/grep.html#Introduction

如果您需要在蟒蛇名單，只是從它執行grep和收集的結果。

來源

2013-12-21 14:51:59 hdante

查找所有文本文件不包含一些文本字符串

回答

相關問題