python：數字範圍到正則表達式匹配字符串

這可能是一個已經解決的問題，但我無法弄清楚。我有兩個較大的整數，我們稱它們爲start_number和end_number（它們代表連續的電話號碼塊）。其他數字（表示爲字符串）將輸入到我的系統中，我需要使用正則表達式將其與「範圍正則表達式」進行匹配，以查看數字字符串是否落在start_number和end_number之間或之間。python：數字範圍到正則表達式匹配字符串

例如：

start_number = 99519000
end_number = 99519099

因此

expression = "^995190[0-9][0-9]$"

讓我最終能夠符合下例數（這在我的系統一次一個到達，並可能在任何時候到達）：

"99519000" < - MATCH
"99519055" < - MATCH
"99519099" < - MATCH
"99519100" < - 不匹配
"99512210" < - 不匹配
"41234123" < - 不匹配

如何使用Python來創建正則表達式字符串模式「expression」給出任何合理start_number和end_number？我有幾個開始/結束數字'塊'我必須創建正則表達式模式，我只需要一種方法來編程這些模式。

這是公平的假設：

Start_number將始終小於end_number
將永遠是一個正整數。
在我的情況下，start_number和end_number將始終保持相同的「長度」（即當以字符串表示時總是具有相同數量的基本10'字符），如果它可以使生活更輕鬆。

編輯：爲清楚起見

來源

2013-07-24 TC Fox

爲什麼不匹配正確的數字位數（'\ d {8}'），然後對於在int中轉換的所有匹配項，檢查它是否在範圍內？ – Scharron

addint to @rnbcoder寫道你不能把str轉換爲int並比較？ 'int（start_number）<= test_number <=（end_number）' –

是否有任何理由必須用正則表達式來完成？我只是將該數字轉換爲一個int，並檢查它是否在start_number和end_number之間。這可能效率稍低，但複雜性的順序是相同的，因爲檢查正則表達式匹配是O（n），並且將字符串轉換爲int是O（n），那麼兩個int比較是O（1）。 –

[假設你需要這個，因爲它的一些奇怪的第三方系統，需要正則表達式]

新方法

我越去想弗雷德裏克評論，我越是同意。即使輸入字符串很長，正則表達式引擎也應該能夠將其編譯爲緊湊的DFA。許多情況下，下面是一個明智的解決辦法：

import re 

def regexp(lo, hi): 
    fmt = '%%0%dd' % len(str(hi)) 
    return re.compile('(%s)' % '|'.join(fmt % i for i in range(lo, hi+1)))

（它工作正常，在下文的測試中所有的數值範圍，包括99519000 - 99519099.粗糙背的最粗略計算表明，9這個數字大約是1GB內存的限制，這就是說，如果大多數數字的大小都匹配，如果只有少數匹配，則可以大得多）。

老方法

[再次更新給予更短的結果 - 從合併偶爾\d\d這是關於好，因爲產生的手分開]

假設所有的數字都是一樣的長度（即你零墊左側如果必要的話），這個工程：

import re 

def alt(*args): 
    '''format regexp alternatives''' 
    if len(args) == 1: return args[0] 
    else: return '(%s)' % '|'.join(args) 

def replace(s, c): 
    '''replace all characters in a string with a different character''' 
    return ''.join(map(lambda x: c, s)) 

def repeat(s, n): 
    '''format a regexp repeat''' 
    if n == 0: return '' 
    elif n == 1: return s 
    else: return '%s{%d}' % (s, n) 

def digits(lo, hi): 
    '''format a regexp digit range''' 
    if lo == 0 and hi == 9: return r'\d' 
    elif lo == hi: return str(lo) 
    else: return '[%d-%d]' % (lo, hi) 

def trace(f): 
    '''for debugging''' 
    def wrapped(lo, hi): 
     result = f(lo, hi) 
     print(lo, hi, result) 
     return result 
    return wrapped 

#@trace # uncomment to get calls traced to stdout (explains recursion when bug hunting) 
def regexp(lo, hi): 
    '''generate a regexp that matches integers from lo to hi only. 
     assumes that inputs are zero-padded to the length of hi (like phone numbers). 
     you probably want to surround with^and $ before using.''' 

    assert lo <= hi 
    assert lo >= 0 

    slo, shi = str(lo), str(hi) 
    # zero-pad to same length 
    while len(slo) < len(shi): slo = '0' + slo 
    # first digits and length 
    l, h, n = int(slo[0]), int(shi[0]), len(slo) 

    if l == h: 
     # extract common prefix 
     common = '' 
     while slo and slo[0] == shi[0]: 
      common += slo[0] 
      slo, shi = slo[1:], shi[1:] 
     if slo: return common + regexp(int(slo), int(shi)) 
     else: return common 

    else: 
     # the core of the routine. 
     # split into 'complete blocks' like 200-599 and 'edge cases' like 123-199 
     # and handle each separately. 

     # are these complete blocks? 
     xlo = slo[1:] == replace(slo[1:], '0') 
     xhi = shi[1:] == replace(shi[1:], '9') 

     # edges of possible complete blocks 
     mlo = int(slo[0] + replace(slo[1:], '9')) 
     mhi = int(shi[0] + replace(shi[1:], '0')) 

     if xlo: 
      if xhi: 
       # complete block on both sides 
       # this is where single digits are finally handled, too. 
       return digits(l, h) + repeat('\d', n-1) 
      else: 
       # complete block to mhi, plus extra on hi side 
       prefix = '' if l or h-1 else '0' 
       return alt(prefix + regexp(lo, mhi-1), regexp(mhi, hi)) 
     else: 
      prefix = '' if l else '0' 
      if xhi: 
       # complete block on hi side plus extra on lo 
       return alt(prefix + regexp(lo, mlo), regexp(mlo+1, hi)) 
      else: 
       # neither side complete, so add extra on both sides 
       # (and maybe a complete block in the middle, if room) 
       if mlo + 1 == mhi: 
        return alt(prefix + regexp(lo, mlo), regexp(mhi, hi)) 
       else: 
        return alt(prefix + regexp(lo, mlo), regexp(mlo+1, mhi-1), regexp(mhi, hi)) 


# test a bunch of different ranges 
for (lo, hi) in [(0, 0), (0, 1), (0, 2), (0, 9), (0, 10), (0, 11), (0, 101), 
       (1, 1), (1, 2), (1, 9), (1, 10), (1, 11), (1, 101), 
       (0, 123), (111, 123), (123, 222), (123, 333), (123, 444), 
       (0, 321), (111, 321), (222, 321), (321, 333), (321, 444), 
       (123, 321), (111, 121), (121, 222), (1234, 4321), (0, 999), 
       (99519000, 99519099)]: 
    fmt = '%%0%dd' % len(str(hi)) 
    rx = regexp(lo, hi) 
    print('%4s - %-4s %s' % (fmt % lo, fmt % hi, rx)) 
    m = re.compile('^%s$' % rx) 
    for i in range(0, 1+int(replace(str(hi), '9'))): 
     if m.match(fmt % i): 
      assert lo <= i <= hi, i 
     else: 
      assert i < lo or i > hi, i

功能regexp(lo, hi)建立該0123893887620之間的匹配值的正則表達式和hi（零填充到最大長度）。您可能需要在之前放置一個^，然後在$之後（如在測試代碼中）強制匹配成爲整個字符串。

該算法實際上很簡單 - 它遞歸地將事物分成普通前綴和「完整塊」。一個完整的塊是類似於200-599並且可以可靠匹配（在這種情況下與[2-5]\d{2}）。

因此123-599分爲123-199和200-599。後半部分是一個完整的塊，前半部分具有共同的前綴1和23-99，它被遞歸地處理爲23-29（通用前綴）和30-99（完整塊）（並且我們最終終止，因爲參數到每個呼叫都比初始輸入短）。

唯一討厭的細節是prefix，這是必須的，因爲參數regexp()是整數，稱爲所以當產生，比方說，對於00-09的正則表達式，它實際上產生了0-9的正則表達式，而不領先0

輸出是一串的測試用例，示出了範圍和正則表達式：

0 - 0  0 
    0 - 1  [0-1] 
    0 - 2  [0-2] 
    0 - 9  \d 
    00 - 10 (0\d|10) 
    00 - 11 (0\d|1[0-1]) 
000 - 101 (0\d\d|10[0-1]) 
    1 - 1  1 
    1 - 2  [1-2] 
    1 - 9  [1-9] 
    01 - 10 (0[1-9]|10) 
    01 - 11 (0[1-9]|1[0-1]) 
001 - 101 (0(0[1-9]|[1-9]\d)|10[0-1]) 
000 - 123 (0\d\d|1([0-1]\d|2[0-3])) 
111 - 123 1(1[1-9]|2[0-3]) 
123 - 222 (1(2[3-9]|[3-9]\d)|2([0-1]\d|2[0-2])) 
123 - 333 (1(2[3-9]|[3-9]\d)|2\d\d|3([0-2]\d|3[0-3])) 
123 - 444 (1(2[3-9]|[3-9]\d)|[2-3]\d{2}|4([0-3]\d|4[0-4])) 
000 - 321 ([0-2]\d{2}|3([0-1]\d|2[0-1])) 
111 - 321 (1(1[1-9]|[2-9]\d)|2\d\d|3([0-1]\d|2[0-1])) 
222 - 321 (2(2[2-9]|[3-9]\d)|3([0-1]\d|2[0-1])) 
321 - 333 3(2[1-9]|3[0-3]) 
321 - 444 (3(2[1-9]|[3-9]\d)|4([0-3]\d|4[0-4])) 
123 - 321 (1(2[3-9]|[3-9]\d)|2\d\d|3([0-1]\d|2[0-1])) 
111 - 121 1(1[1-9]|2[0-1]) 
121 - 222 (1(2[1-9]|[3-9]\d)|2([0-1]\d|2[0-2])) 
1234 - 4321 (1(2(3[4-9]|[4-9]\d)|[3-9]\d{2})|[2-3]\d{3}|4([0-2]\d{2}|3([0-1]\d|2[0-1]))) 
000 - 999 \d\d{2} 
99519000 - 99519099 995190\d\d

它需要一段時間來運行作爲最後的測試遍歷99999999號。

表達式應該足夠緊湊，以避免任何緩衝區限制（我猜想內存大小最壞的情況是成正比的最大數字的位數的平方）。

ps我使用python 3，但我不認爲它在這裏有很大的不同。

來源

2013-07-24 16:51:09

作爲一個腳註，Python的正則表達式機器可以自動執行其中的一些操作，例如， 're.sre_parse.parse（「|」.join（str（i）for i in range（99519000,99519100）））'你會看到它如何提取共享前綴。我認爲編譯器不夠聰明，無法映射\ 0，但是。 – Fredrik

祝你好運調試 – mbatchkarov

有沒有你不明白的東西？我已經給代碼添加了評論，但是我很樂意解釋更多，如果有一點你沒有得到（你不需要調試正則表達式 - 這是你需要擔心的代碼）。 –

python：數字範圍到正則表達式匹配字符串

回答

相關問題