Python：使用re.sub多次替換多個子字符串

我正在嘗試更正一些包含一些非常典型的掃描錯誤（我誤認爲是I，反之亦然）的文本。基本上我想有re.sub替換字符串依靠的次數「我」被檢測到，這樣的事情：Python：使用re.sub多次替換多個子字符串

re.sub("(\w+)(I+)(\w*)", "\g<1>l+\g<3>", "I am stiII here.")

什麼是實現這一目標的最佳方式是什麼？

2012-03-28 Anne L.

你能舉出你遇到的其他情況的例子嗎？ – 2012-03-28 12:10:20

傳遞函數作爲替換字符串，如the docs中所述。你的功能可以識別這個錯誤，並基於此創建最佳替代。

def replacement(match): 
    if "I" in match.group(2): 
     return match.group(1) + "l" * len(match.group(2)) + match.group(3) 
    # Add additional cases here and as ORs in your regex 

re.sub(r"(\w+)(II+)(\w*)", replacement, "I am stiII here.") 
>>> I am still here.

（請注意，我修改你的正則表達式，從而重複是會出現在一個組。）

來源

2012-03-28 12:05:50 DNS

是否需要修改正則表達式？ '+'是貪婪的... – mgilson 2012-03-28 12:28:58

是的;否則第一個我會被第一組中的\ w +吞下。 – DNS 2012-03-28 12:52:37

謝謝，我沒有意識到這個機制存在。捕捉特殊情況而不會使正則表達式複雜化非常有用。我在下面的答案中發佈了我的最終代碼。 – 2012-03-29 06:03:49

在我看來，你可以這樣做：

def replace_L(match): 
    return match.group(0).replace(match.group(1),'l'*len(match.group(1))) 

string_I_want=re.sub(r'\w+(I+)\w*',replace_L,'I am stiII here.')

來源

2012-03-28 12:14:45 mgilson

您可以使用一個lookaround由另一I僅更換I秒，然後或前面：

print re.sub("(?<=I)I|I(?=I)", "l", "I am stiII here.")

來源

2012-03-28 12:49:29 georg

基於由DNS提出的答案，我建了一些更復雜的抓住所有的情況下（或至少大多數），儘量不增加太多的錯誤：

def Irepl(matchobj): 
    # Catch acronyms 
    if matchobj.group(0).isupper(): 
     return matchobj.group(0) 
    else: 
     # Replace Group2 with 'l's 
     return matchobj.group(1) + 'l'*len(matchobj.group(2)) + matchobj.group(3) 


# Impossible to know if first letter is correct or not (possibly a name) 
I_FOR_l_PATTERN = "([a-zA-HJ-Z]+?)(I+)(\w*)" 
for line in lines: 
    tmp_line = line.replace("l'", "I'").replace("'I", "'l").replace(" l ", " I ") 
    tmp_line = re.sub("^l ", "I ", tmp_line) 

    cor_line = re.sub(I_FOR_l_PATTERN, Irepl, tmp_line) 

    # Loop to catch all errors in a word (iIIegaI for example) 
    while cor_line != tmp_line: 
     tmp_line = cor_line 
     cor_line = re.sub(I_FOR_l_PATTERN, Irepl, tmp_line)

希望這有助於有人其他！

來源

2012-03-29 06:02:07

Python：使用re.sub多次替換多個子字符串

回答

相關問題