findall（）正則表達式迭代通過文件從列表中尋找單詞

我有遍歷文件遞歸查找單詞從列表中的代碼。如果找到它，則會打印出找到的文件，搜索到的字符串以及找到的行。findall（）正則表達式迭代通過文件從列表中尋找單詞

我的問題是，當搜索api也匹配myapistring，'通'匹配'指南針'，'dev'匹配'設備'而不是實際的單詞。所以我需要在某處實現一個正則表達式，但我不確定在for循環的哪個部分和哪個部分。

我已經得到了我（覺得）的正則表達式的工作原理是：

regex='([\w.]+)' 

rootpath=myDir 
wordlist=["api","pass","dev"] 
exclude=["testfolder","testfolder2"] 
complist=[] 

for word in wordlist: 
     complist.extend([re.compile(word)]) 

    for path,name,fname in os.walk(rootpath): 
     name[:] = [d for d in name if d not in exclude] 
     for fileNum in fname: 
      i=path+"/"+fileNum 
      files.append(i) 

    for fileLine in files: 
     if any(ext in fileLine for ext in exten):  
      count=0 
      for line in open(fileLine, "r").readlines(): 
       count=count+1 
       for lv in complist: 
        match = lv.findall(line, re.IGNORECASE) 

        for mat in match: 
         [print output]

感謝

編輯：加入提供了這個代碼：

for word in wordlist: 
     complist.extend([re.compile('\b' + re.escape(word) + '\b')])

與幾個工作錯誤，但足夠好，我可以與之合作。

來源

2016-02-02 Bob

的http：//計算器。 com/questions/15863066/python-regular-expression-match-whole-word –

謝謝，但這並不能幫助我在哪裏放置正則表達式，以便它只找到整行中的單詞而不是一個實例字。 – Bob

我不知道python，但是我可以在這行之後猜測：「for line in open（fileLine，」r「）。readlines（）：」with line as「re.search（r'\ bis \ b' ，線）「 –

代替：

for word in wordlist: 
    complist.extend([re.compile(word)])

使用word boundaries：

for word in wordlist: 
    complist.extend([re.compile(r'\b{}\b'.format(word))])

的\b是用於開始或一個字的結束的零長度匹配，所以\bthe\b將匹配這一行：

the lazy dog

但不是這一行：

then I checked StackOverflow

另一件事我想指出的是，如果word包含任何特殊字符意味着什麼的正則表達式引擎，他們會得到解釋爲正則表達式的一部分。因此，而不是：

complist.extend([re.compile(r'\b{}\b'.format(word))])

用途：

complist.extend([re.compile(r'\b{}\b'.format(re.escape(word)))])

編輯：正如評論所說，你也想匹配_分開的話。 _被認爲是「單詞字符」被Python，所以，將它作爲一個字分隔符，你可以這樣做：

re.compile(r'(?:\b|_){}(?:\b|_)'.format(re.escape(word)))

這裏看到這個工作：

In [45]: regex = re.compile(r'(?:\b|_){}(?:\b|_)'.format(re.escape(word))) 

In [46]: regex.search('this line contains is_admin') 
Out[46]: <_sre.SRE_Match at 0x105bca3d8> 

In [47]: regex.search('this line contains admin') 
Out[47]: <_sre.SRE_Match at 0x105bca4a8> 

In [48]: regex.search("does not have the word") 

In [49]: regex.search("does not have the wordadminword")

來源

2016-02-02 10:59:20 Will

給出一些奇怪的結果。它顯然匹配：String'（'u00b6'，'w'）'在行號30處找到，但是不是我的單詞列表。它找不到列表中的單詞，儘管知道它們在那裏，因爲re.compile（單詞）發現它們 – Bob

對不起，試試我的編輯！我們需要'r'raw strings''來保持python解釋'\ b'。 – Will

謝謝，這有效，但它錯過了我期待的東西。將complist.extend（[re.compile（r'\ b {} \ b'.format（re.escape（word）））]]）找到is_admin如果'admin'在單詞列表中？目前不是，我猜是因爲下劃線？ – Bob

findall（）正則表達式迭代通過文件從列表中尋找單詞

回答

相關問題