正則表達式 - 可選範圍

使用Python的re模塊匹配數，我試圖從語句，例如拿到的美元值：正則表達式 - 可選範圍

「$ 305,000 - $ 349,950」應該給這樣一個元組（305000，349950 >（1290000）
」 ... $ 485,000 $和510,000" - - ）
「中秋$ 2M的買家」 - >（200萬）
「......買家指南$ 1.29M +」>（485000 ，510000）

下面的模式適用於單個值，但如果有範圍（如上面的第一個和最後一個點點），它只會給我最後一個數字（即， 349950和510000）。

_pattern = r"""(?x) 
    ^
    .* 
    (?P<target1> 
     [€$£] 
     \d{1,3} 
     [,.]? 
     \d{0,3} 
     (?:[,.]\d{3})* 
     (?P<multiplyer1>[kKmM]?\s?[mM]?) 
    ) 
    (?:\s(?:\-|\band\b|\bto\b)\s)? 
    (?P<target2> 
     [€$£] 
     \d{1,3} 
     [,.]? 
     \d{0,3} 
     (?:[,.]\d{3})* 
     (?P<multiplyer2>[kKmM]?\s?[mM]?) 
    )? 
    .*? 
    $ 
    """

嘗試時target2 = match.group("target2").strip() target2總是顯示爲None。

我絕不是一個regexpert，但不能真正看到我在這裏做錯了什麼。乘法器組工作，對我來說，似乎target2組是相同的模式，即在最後的可選匹配。

我希望我有些理解這種措辭...

來源

2016-02-14 pandita

+1使用詳細模式的正則表達式模式

的.*在模式的開頭是貪婪的，所以它會試圖匹配整條線。然後它回退以匹配target1。該模式中的其他所有內容都是可選的，因此匹配target1與該行上的最後一個匹配是成功的匹配。您可以嘗試通過添加'？'來使第一個.*不貪心。像這樣：

_pattern = r"""(?x) 
    ^
    .*?     <-- add the ? 
    (?P<target1> 
    ... snip ... 
    """

你可以做增量嗎？

_pattern = r"""(?x) 
    (?P<target1> 
     [€$£] 
     \d{1,3} 
     [,.]? 
     \d{0,3} 
     (?:[,.]\d{3})* 
     (?P<multiplyer1>[kKmM]?\s?[mM]?) 
    ) 
    (?P<more>\s(?:\-|\band\b|\bto\b)\s)? 
    """ 

match = re.search(_pattern, line) 
target1, more = match.groups() 
if more: 
    target2 = re.search(_pattern, line, start=match.end())

編輯還有一個想法：儘量re.findall（）：

_pattern = r"""(?x) 
    (?P<target1> 
     [€$£] 
     \d{1,3} 
     [,.]? 
     \d{0,3} 
     (?:[,.]\d{3})* 
     (?P<multiplyer1>[kKmM]?\s?[mM]?) 
    ) 
""" 

targets = re.findall(_pattern, line)

來源

2016-02-14 04:12:59 RootTwo

不幸的是，第一個選項沒有奏效。結果是相同的，雖然它報告了範圍中的第一個數字。我認爲你的第二個建議是一個好主意，但它也沒有效果。順便說一句，語法似乎是're.search（patter，line）'。無論如何，「more」組似乎總是沒有...... – pandita

修復了re.search（）的調用 – RootTwo

你可以使用re.findall（）嗎？只有target1的模式？它應該返回所有匹配的列表。 – RootTwo

你能想出一些正則表達式的邏輯與函數轉換縮位號碼組合。下面是一些示例Python代碼：

# -*- coding: utf-8> -*- 
import re, locale 
from locale import * 
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 

string = """"$305,000 - $349,950" 
"Mid $2M's Buyers" 
"... Buyers Guide $1.29M+" 
"...$485,000 and $510,000" 
""" 

def convert_number(number, unit): 
    if unit == "K": 
     exp = 10**3 
    elif unit == "M": 
     exp = 10**6 
    return (atof(number) * exp) 

matches = [] 
rx = r""" 
    \$(?P<value>\d+[\d,.]*)   # match a dollar sign 
            # followed by numbers, dots and commas 
            # make the first digit necessary (+) 
    (?P<unit>M|K)?     # match M or K and save it to a group 
    (        # opening parenthesis 
     \s(?:-|and)\s    # match a whitespace, dash or "and" 
     \$(?P<value1>\d+[\d,.]*) # the same pattern as above 
     (?P<unit1>M|K)? 
    )?        # closing parethesis, 
            # make the whole subpattern optional (?) 
""" 
for match in re.finditer(rx, string, re.VERBOSE): 
    if match.group('unit') is not None: 
     value1 = convert_number(match.group('value'), match.group('unit')) 
    else: 
     value1 = atof(match.group('value')) 
    m = (value1) 
    if match.group('value1') is not None: 
     if match.group('unit1') is not None: 
      value2 = convert_number(match.group('value1'), match.group('unit1')) 
     else: 
      value2 = atof(match.group('value1')) 
     m = (value1, value2) 
    matches.append(m) 

print matches 
# [(305000.0, 349950.0), 2000000.0, 1290000.0, (485000.0, 510000.0)]

代碼使用相當長的一段邏輯，它首先導入locale模塊爲atof()功能，限定了用於與在該代碼解釋一個正則表達式的範圍的函數convert_number()和搜索。您顯然可以添加其他貨幣符號，如€$£，但它們不在您的原始示例中。

來源

2016-02-14 08:31:43 Jan

我喜歡這個解決方案。雖然這不是一個純粹的正則表達式的解決方案，但更容易理解 – murphy

@murphy：在我看來，它不一定是單獨的正則表達式:) – Jan

是的。如果我必須在3個月內調試這些代碼，那麼我很有可能一下子理解它。如果我使用一個邪惡的正則表達式現在對我的自我很有幫助，但是讓我在3個月內討厭自己:) – murphy

正則表達式 - 可選範圍

回答

相關問題