正則表達式貪婪問題

我需要使用Python解析一個字符串，並提取由:（冒號）分隔的2個令牌，可以用單引號，雙引號或無引號括起來。正則表達式貪婪問題

樣品的情況下工作：

# <input string> -> <tuple that should return> 

1) abc:def -> (abc, def) 
2) abc:"def" -> (abc, def) 
3) "abc":def -> (abc, def) 
4) "abc":"def" -> (abc, def) 
5) "a:bc":abc -> (a:bc, abc)

樣品的情況下不工作：使用

# <input string> -> <tuple that should return> 

6) abc:"a:bc" -> (abc, a:bc) 
7) "abcdef" -> (abcdef,)

正則表達式是：

>>> import re 
>>> rex = re.compile(r"(?P<fquote>[\'\"]?)" 
        r"(?P<user>.+)" 
        r"(?P=fquote)" 
        r"(?:\:" 
        r"(?P<squote>[\'\"]?)" 
        r"(?P<pass>.+)" 
        r"(?P=squote))")

我有兩個問題，第一個樣本案例6）和7）不工作，第二個rex.match後，我想所有組匹配，但不是fquote和squote個。我的意思是現在rex.match("'abc':'def').groups()返回("'", "abc", "'", "def")，我只想("abc", "def")。

任何想法？

感謝

來源

2013-03-11 user1595496

你可以使用，而不是一個正則表達式這裏的csv模塊：

inputs = [ 
    'abc:def', 'abc:"def"', '"abc":def', '"abc":"def"', '"a:bc":abc', #working 
    'abc:"a:bc"', 'abcdef' # not working 

] 

import csv 
for idx, el in enumerate(inputs, start=1): 
    print idx, tuple(next(csv.reader([el], delimiter=':')))

它給你：

1 ('abc', 'def') 
2 ('abc', 'def') 
3 ('abc', 'def') 
4 ('abc', 'def') 
5 ('a:bc', 'abc') 
6 ('abc', 'a:bc') 
7 ('abcdef',)

來源

2013-03-11 16:18:45

它並不單引號和雙引號混合使用。例如：s ='\'abc \'：「a：bc」' – user1595496 2013-03-11 16:23:58

@ user1595496 Ahh - 沒有發現您正在尋找的正則表達式中的任一個或者......只是以您的示例數據爲例） – 2013-03-11 16:25:31

我只想一個正則表達式，而不是使用外部模塊，但無論如何感謝。 – user1595496 2013-03-11 16:30:01

def foo(string): 
    rex = re.compile(r"(?P<fquote>[\'\"]?)" 
        r"(?P<user>.+?)" 
        r"(?:(?P=fquote))" 
        r"(?:\:" 
        r"(?P<squote>[\'\"]?)" 
        r"(?P<pass>.+)" 
        r"(?P=squote))" 
        r"|(?P<sfquote>[\'\"]?)" 
        r"(?P<suser>.+)" 
        r"(?:(?P=sfquote))") 
    match = rex.match(string) 
    suser_match = match.group("suser") 
    return (suser_match,) if suser_match else (match.group("user"), match.group("pass"))

這做工作，但我強烈勸阻它。正則表達式應該儘可能簡單，因爲這種解決方案很難理解，因此難以維護。您可能需要一個上下文無關語法，在我看來，它與您給出的模式類型（例如"abcdef"字符串，它需要一個單獨的組）相匹配。

你的第二個問題是符號組被捕獲，即使你把它們放在(?:...)之內。這就是爲什麼我認爲檢索它們更容易，然後使用匹配的符號組創建元組的原因。

來源

2013-03-11 18:03:43

爲什麼你必須檢索所有的組？只要拿走你感興趣的那些，而忽略其餘的。這裏有一個例子：

rex = re.compile(
    r"""^(?: 
     (?P<fquote>['"]) 
     (?P<user1>(?:(?!(?P=fquote)).)+) 
     (?P=fquote) 
     | 
     (?P<user2>[^:"'\s]+) 
    ) 
    (?: 
     : 
     (?: 
     (?P<squote>['"]) 
     (?P<pass1>(?:(?!(?P=squote)).)+) 
     (?P=squote) 
     | 
     (?P<pass2>[^:"'\s]+) 
    ) 
    )? 
    $""", 
    re.VERBOSE) 

result = rex.sub(r"\g<user1>\g<user2> : \g<pass1>\g<pass2>", subject)

其他注意事項：

拆分它來處理引用和不帶引號的領域分別做的工作太多，容易得多。你知道每一組中的一組將總是空的，所以將它們連接起來是安全的。
(?:(?!(?P=fquote)).)+一次只消耗一個字符，但只有在確認該字符與開場白不一致後纔會使用。您不必擔心超過結束報價，如.+會。（你真的應該一直使用不願意.+?有，但這是方式，甚至更好。）

來源

2013-03-11 20:53:51

正則表達式貪婪問題

回答

相關問題