分割的正則表達式 - 將單詞拆分爲詞素或詞綴

我試圖在將單詞分割爲詞的後綴和前綴（即詞素或詞綴）後得到一個列表。分割的正則表達式 - 將單詞拆分爲詞素或詞綴

我試過使用正則表達式，與re.findall函數。
（如下所示）

>>> import re 
>>> affixes = ['meth','eth','ketone', 'di', 'chloro', 'yl', 'ol'] 
>>> word = 'dimethylamin0ethanol' 
>>> re.findall('|'.join(affixes), word) 

['di', 'meth', 'yl', 'eth', 'ol']

然而，我需要在其中它不匹配被包括在部分。舉例來說，上面的例子將需要輸出：

['di', 'meth', 'yl', 'amin0', 'eth', 'an', 'ol']

有誰知道如何提取列表中的這些部分？

來源

2016-12-06 chase

您可以使用re.split()捕捉「分隔符」：

In [1]: import re 

In [2]: affixes = ['meth', 'eth', 'ketone', 'di', 'chloro', 'yl', 'ol'] 

In [3]: word = 'dimethylamin0ethanol' 

In [4]: [match for match in re.split('(' + '|'.join(affixes) + ')', word) if match] 
Out[4]: ['di', 'meth', 'yl', 'amin0', 'eth', 'an', 'ol']

這裏的列表理解是過濾空字符串匹配。

來源

2016-12-06 05:10:03 alecxe

import re 

affixes = ['meth','eth','ketone', 'di', 'chloro', 'yl', 'ol'] 
word = 'dimethylamin0ethanol' 

# found = ['amin0', 'an', 'di', 'meth', 'yl', 'eth', 'ol'] 
found = re.findall('|'.join(affixes), word) 

# not_found = [('', 'di'), ('', 'meth'), ('', 'yl'), ('amin0', 'eth'), ('an', 'ol')] 
not_found = re.findall(r'(.*?)(' + '|'.join(affixes) + ')', word) 

# We need to modify extract the first item out of each tuple in not_found 
# ONLY when it does not equal "". 
all_items = map(lambda x: x[0], filter(lambda x: x[0] != "", not_found)) + found 

print all_items 
# all_items = ['amin0', 'an', 'di', 'meth', 'yl', 'eth', 'ol']

假設：你的最終名單並不需要特定的順序。

來源

2016-12-06 05:32:17

分割的正則表達式 - 將單詞拆分爲詞素或詞綴

回答

相關問題