在Python 3中，我希望能夠以「不區分重音」的方式使用re.sub()，正如我們可以使用re.I標誌進行不區分大小寫的替換一樣。Python中對變音不敏感替換的正則表達式

可能會像re.IGNOREACCENTS標誌：

original_text = "¿It's 80°C, I'm drinking a café in a cafe with Chloë。" 
accent_regex = r'a café' 
re.sub(accent_regex, 'X', original_text, flags=re.IGNOREACCENTS)

這將導致「¿這是80°C，我在喝與CHLOE X X」（注意，仍然是一個重音放在「 Chloë「），而不是」¿這是80°C，我在Chloë的一家咖啡館喝了X「，在真正的蟒蛇中。

我認爲這樣的標誌不存在。那麼做這件事最好的選擇是什麼？在original_text和accent_regex上使用re.finditer和unidecode，然後通過拆分字符串來替換？或者用其重音變體修改accent_regex中的所有字符，例如：r'[cç][aàâ]f[éèêë]'？

來源

2017-04-26 Antoine Dusséaux

'可能是類似...'@WiktorStribiżew – revo

你正在尋找的是一個等價類 - 雖然我不知道任何支持它們的Python正則表達式模塊。語法通常就像'[[= a =]]' –

unidecode經常在Python中去除口音提及，但它也確實不止於此：它轉換'°'到'deg' ，這可能不是所需的輸出。

unicodedata似乎有enough functionality to remove accents。

使用任何圖案

此方法應適用於任何圖案和任何文本。

您可以暫時從文本和正則表達式模式中刪除重音符號。來自re.finditer()（開始和結束索引）的匹配信息可用於修改原始重音文本。

請注意，必須顛倒匹配以便不修改以下索引。

import re 
import unicodedata 

original_text = "I'm drinking a 80° café in a cafe with Chloë, François Déporte and Francois Deporte." 

accented_pattern = r'a café|François Déporte' 

def remove_accents(s): 
    return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')) 

print(remove_accents('äöüßéèiìììíàáç')) 
# aoußeeiiiiiaac 

pattern = re.compile(remove_accents(accented_pattern)) 

modified_text = original_text 
matches = list(re.finditer(pattern, remove_accents(original_text))) 

for match in matches[::-1]: 
    modified_text = modified_text[:match.start()] + 'X' + modified_text[match.end():] 

print(modified_text) 
# I'm drinking a 80° café in X with Chloë, X and X.

如果模式是一個詞或詞的集合

，你可以：

去除口音出你的模式的話，並將其保存在一組快速查找
使用\w+
查找您的文本中的每個單詞刪除單詞中的重音：
- 如果匹配，通過X
- 取代如果不匹配，離開字不變

import re 
from unidecode import unidecode 

original_text = "I'm drinking a café in a cafe with Chloë." 

def remove_accents(string): 
    return unidecode(string) 

accented_words = ['café', 'français'] 

words_to_remove = set(remove_accents(word) for word in accented_words) 

def remove_words(matchobj): 
    word = matchobj.group(0) 
    if remove_accents(word) in words_to_remove: 
     return 'X' 
    else: 
     return word 

print(re.sub('\w+', remove_words, original_text)) 
# I'm drinking a X in a X with Chloë.

來源

2017-04-26 13:19:06

謝謝，這種方法很聰明！如何修改它以替換單詞而不是n-gram？（我編輯了我的問題，將此選項考慮在內，例如，在僅出現「Francois Deporte」的文本中替換「FrançoisDéporte」） –

@AntoineDusséaux：沒問題，第一種方法可以正常工作。 –

我最初認爲它會工作，但經過幾次測試，如果未編碼字符串的長度與原始字符串的長度不一致，則此方法失敗。例如，'unidecode'（'°'）'是'deg'，所以如果'original_text =「我正在Chloë，FrançoisDéporte和Francois Deporte的一家咖啡館喝一個18度的咖啡館。我在XithChloë，FrXnd FrX''飲用18C°的熱咖啡。什麼是另一種方式unidecode並保持長度不變？ –

您可以使用Unidecode：

$ pip install unidecode

在你的程序：

from unidecode import unidecode 

original_text = "I'm drinking a café in a cafe." 
unidecoded_text = unidecode(original_text) 
regex = r'cafe' 
re.sub(regex, 'X', unidecoded_text)

來源

2017-04-26 12:47:28 horcrux

謝謝，但這不會有幫助，因爲我想保留原文的其他口音。 –

@AntoineDusséaux對，我沒有想過這件事。另一個答案似乎是正確的。 – horcrux

Python中對變音不敏感替換的正則表達式

回答

使用任何圖案

如果模式是一個詞或詞的集合

相關問題