Python：如何查找多行字符串中的所有匹配項，但沒有按照特定的詞進行搜索？

我有SQL代碼，我想在「插入」關鍵字後提取表名。Python：如何查找多行字符串中的所有匹配項，但沒有按照特定的詞進行搜索？

基本上，我想用下面的規則來提取：

包含單詞「插入」
其次是字「到」，這是可選的
排除如果有一個「 - 「（這是SQL中的單行註釋），在插入（可選）關鍵字之前的任何地方。
如果插入（可選）關鍵字介於「/ *」和「* /」（它是SQL中的多行註釋）之間，則排除。
獲取插入之後的下一個字（表名）到（可選）關鍵字

例子：

import re 

lines = """begin insert into table_1 end 
    begin insert table_2 end 
    select 1 --This is will not insert into table_3 
    begin insert into 
     table_4 
    end 
    /* this is a comment 
    insert into table_5 
    */ 
    insert into table_6 
    """ 

p = re.compile(r'^((?!--).)*\binsert\b\s+(?:into\s*)?.*', flags=re.IGNORECASE | re.MULTILINE) 
for m in re.finditer(p, lines): 
    line = lines[m.start(): m.end()].strip() 

    starts_with_insert = re.findall('insert.*', line, flags=re.IGNORECASE|re.MULTILINE|re.DOTALL) 
    print re.compile('insert\s+(?:into\s+)?', flags=re.IGNORECASE|re.MULTILINE|re.DOTALL).split(' '.join(starts_with_insert))[1].split()[0]

實際結果：

table_1 
table_2 
table_4 
table_5 
table_6

預期結果：不應返回table_5因爲它介於/ *和*/

table_1 
table_2 
table_4 
table_6

有沒有一個優雅的方式來做到這一點？

在此先感謝。

編輯：感謝您的解決方案。是否可以使用純粹的正則表達式而不需要從原始文本中剝離線？

我想顯示的行號可以從原始字符串中找到表名稱。

更新下面的代碼：

import re 

lines = """begin insert into table_1 end 
    begin insert table_2 end 
    select 1 --This is will not insert into table_3 
    begin insert into 
     table_4 
    end 
    /* this is a comment 
    insert into table_5 
    */ 
    insert into table_6 
    """ 

p = re.compile(r'^((?!--).)*\binsert\b\s+(?:into\s*)?.*', flags=re.IGNORECASE | re.MULTILINE) 
for m in re.finditer(p, lines): 
    line = lines[m.start(): m.end()].strip() 
    line_no = str(lines.count("\n", 0, m.end()) + 1).zfill(6) 

    table_names = re.findall(r'(?:\binsert\s*(?:into\s*)?)(\S+)', line, flags=re.IGNORECASE|re.MULTILINE|re.DOTALL) 
    print '[line number: ' + line_no + '] ' + '; '.join(table_names)

使用超前/回顧後，以排除那些/ *和* /但它不是我的生產預期的結果之間的審判。

希望你的幫助。謝謝！

來源

2017-10-05 pren

你忘了所有的'--'和'/*'在insid時可能不會開始註釋e字符串... –

我認爲你應該瞭解'lookbehind assertion' –

在2個步驟，re.sub()和re.findall()功能：

# removing single line/multiline comments 
stripped_lines = re.sub(r'/\*[\s\S]+\*/\s*|.*--.*(?=\binsert).*\n?', '', lines, re.S | re.I) 

# extracting table names preceded by `insert` statement 
tbl_names = re.findall(r'(?:\binsert\s*(?:into\s*)?)(\S+)', stripped_lines, re.I) 
print(tbl_names)

輸出：

['table_1', 'table_2', 'table_4', 'table_6']

來源

2017-10-05 11:11:24 RomanPerekhrest

嗨羅馬，你使用findall的解決方案比我原來的更簡單。用你的解決方案更新了上面的代碼。這可能實現相同的結果，而不剝離原始文本？ – pren

import re 
import string 

lines = """begin insert into table_1 end 
    begin insert table_2 end 
    select 1 --This is will not insert into table_3 
    begin insert into 
     table_4 
    end 
    /* this is a comment 
    insert into table_5 
    */ 
    insert into table_6 
    """ 

# remove all /* */ and -- comments 
comments = re.compile('/\*(?:.*\n)+.*\*/|--.*?\n', flags=re.IGNORECASE | re.MULTILINE) 
for comment in comments.findall(lines): 
    lines = string.replace(lines, comment, '') 

fullSet = re.compile('insert\s+(?:into\s+)*(\S+)', flags=re.IGNORECASE | re.MULTILINE) 
print fullSet.findall(lines)

給

['table_1', 'table_2', 'table_4', 'table_6']

來源

2017-10-05 12:35:51

感謝卡爾文爲您提供了不錯的解決方案。這真的有用，但想知道是否可以直接使用正則表達式而不刪除行？上面更新了問題。謝謝 – pren

正則表達式沒有解密上下文的機制。完全刪除評論，保證你永遠不會找到他們。正如你可能會意識到正則表達式很快就會退化成不可讀的混亂。我會進一步壓縮我的答案。我認爲如果沒有移除步驟，你就無法得到它。如果情況存在太多的話。我會多玩一點。 –

Python：如何查找多行字符串中的所有匹配項，但沒有按照特定的詞進行搜索？

回答

相關問題