確保兩個正則表達式找不到相同的結果

我想從字符串中解析出所有日期（可能以不同的形式寫入）。問題是可能有一個日期寫在這種形式d/m -y例如22/11 -12。但是也可能有一個日期用這種形式d/m寫成，沒有指定年份。如果我在這個包含更長形式的字符串中找到日期，我不希望它以更短的形式再次找到。這是我的代碼失敗的地方，它找到了第一個日期兩次（一年與一年，一次沒有它）。確保兩個正則表達式找不到相同的結果

我真的有兩個問題：（1）做這件事的「正確」方法是什麼？看來我真的從錯誤的角度來解決這個問題。（2）如果我應該堅持這樣做的話，那麼這條線datestring.replace(match.group(0), '')怎麼會不刪除日期以至於我再也找不到了？

這是我的代碼：

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import re 

dformats = (
    '(?P<day>\d{1,2})/(?P<month>\d{1,2}) -(?P<year>\d{2})', 
    '(?P<day>\d{1,2})/(?P<month>\d{1,2})', 
    '(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})', 
      ) 


def get_dates(datestring): 
    """Try to extract all dates from certain strings. 

    Arguments: 
    - `datestring`: A string containing dates. 
    """ 
    global dformats 

    found_dates = [] 

    for regex in dformats: 
     matches = re.finditer(regex, datestring) 
     for match in matches: 
      # Is supposed to make sure the same date is not found twice 
      datestring.replace(match.group(0), '') 

      found_dates.append(match) 
    return found_dates 

if __name__ == '__main__': 
    dates = get_dates('1/2 -13, 5/3 & 2012-11-22') 
    for date in dates: 
     print date.groups()

來源

2012-11-22 Niclas Nilsson

兩種方式：

使用一個正則表達式，並使用|運營商所有的情況下，聯合起來：

expr = re.compile (r"expr1|expr2|expr3")
僅查找單個實例，然後通過下一個搜索「開始位置」。請注意，這會使事情變得複雜，因爲無論選擇哪種格式，您都希望始終以最早的匹配開始。也就是說，在所有三場比賽中循環，找出最早的比賽，做替換，然後用增加的起始位置重新做。這使得選項1更容易。

幾個附加分：

確保你使用的「原始字符串」：前面加上在每個字符串的前一個「R」。否則，'\'字符冒險被吃掉並且不會傳遞到RE引擎
考慮使用「sub」和回調函數代替「repl」參數來執行替換，而不是finditer。在這種情況下，「repl」傳遞一個匹配對象，並且應該返回替換字符串。
如果沒有選擇該替代選項，則在「單個」re中的匹配組將具有值None，因此可以輕鬆檢測到使用了哪個替代方案。
除非您打算修改該變量，否則不應該說「全局」。

這是一些完整的工作代碼。

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import re 

expr = re.compile(
    r'(?P<day1>\d{1,2})/(?P<month1>\d{1,2}) -(?P<year>\d{2})|(?P<day2>\d{1,2})/(?P<month2>\d{1,2})|(?P<year3>\d{4})-(?P<month3>\d{2})-(?P<day3>\d{2})') 


def get_dates(datestring): 
    """Try to extract all dates from certain strings. 

    Arguments: 
    - `datestring`: A string containing dates. 
    """ 

    found_dates = [] 
    matches = expr.finditer(datestring) 
    for match in matches: 
     if match.group('day1'): 
      found_dates.append({'day': match.group('day1'), 
           'month': match.group('month1') }) 
     elif match.group('day2'): 
      found_dates.append({'day': match.group('day2'), 
           'month': match.group('month2')}) 
     elif match.group('day3'): 
      found_dates.append({'day': match.group('day3'), 
           'month': match.group('month3'), 
           'year': match.group('year3')}) 
     else: 
      raise Exception("wtf?") 
    return found_dates 

if __name__ == '__main__': 
    dates = get_dates('1/2 -13, 5/3 & 2012-11-22') 
    for date in dates: 
     print date

來源

2012-11-22 21:08:09

很好的回答，恭喜！ – georg

Emacs抱怨你不遵循pep-8 ;-)否則一個很好的答案。儘管在正則表達式中使用'|'這不是處理相同名稱組的簡單方法，但我困擾了我。有6行處理這個問題（！）： -/ –

Niclas：是的，我編輯了它大約3次，可能仍然沒有正確的答案:)另外，是的，我很沮喪，RE引擎didn'讓我重新使用這些名稱，我可以創建一個簡單的函數來重新映射以清理代碼。如果我們可以假設我們總是將PERIOD-X拷貝到PERIOD，如果dayX存在，X在範圍內（1,4），PERIOD在['day'，'month'，'year']，那麼這只是一個幾行。任何更復雜的處理可能都必須像上面那樣分解它。 –

您可以使用negative look ahead在你的第二個正則表達式來這不跟-year只匹配那些dates： -

dformats = (
    r'(?P<day>\d{1,2})/(?P<month>\d{1,2}) -(?P<year>\d{2})', 
    r'(?P<day>\d{1,2})/(?P<month>\d{1,2})(?!\s+-(?P<year>\d{2}))', 
    r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})' 
)

因此，這在first正則表達式匹配的日期，也不會在第二個匹配。

來源

2012-11-22 21:08:39

您可以sub而不是find：

def find_dates(s): 

    dformats = (
     '(?P<day>\d{1,2})/(?P<month>\d{1,2}) -(?P<year>\d{2})', 
     '(?P<day>\d{1,2})/(?P<month>\d{1,2})', 
     '(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})', 
    )  

    dates = [] 
    for f in dformats: 
     s = re.sub(f, lambda m: dates.append(m.groupdict()), s) 
    return dates

來源

2012-11-22 21:10:08 georg

確保兩個正則表達式找不到相同的結果

回答

相關問題