術語在文本

匹配我的術語列表如下：術語在文本

a 
abc 
a abc 
a a abc 
abc

我想匹配的文本的條款，或者變更爲「字詞1，詞條2」的名字。但我想找到最長的匹配作爲正確的匹配。

Text: I have a and abc maybe abc again and also a a abc. 
Output: I have term1 and term2 maybe term2 again and also a term3.

到目前爲止，我用下面的代碼，但它並沒有找到最長的匹配：

for x in terms: 
    if x in text: 
     do blabla

來源

2017-09-03 user1979556

您可以使用re.sub

import re 

words = ["a", 
"abc", 
"a abc", 
"a a abc" 
] 

test_str = "I have a and abc maybe abc again and also a a abc." 

for word in sorted(words, key=len, reverse=True): 
    term = "\1term%i\2" % (words.index(word)+1) 
    test_str = re.sub(r"(\b)%s(\b)"%word, term, test_str) 

print(test_str)

它將讓您的「預期」的結果（您在示例中犯了錯誤）

Input: I have a and abc maybe abc again and also a a abc. 
Output: I have term1 and term2 maybe term2 again and also term4.

來源

2017-09-03 09:18:29 abccd

或使用應用re.sub替換功能：

import re 

text = 'I have a and abc maybe abc again and also a a abc' 
words = ['a', 'abc', 'a abc', 'a a abc'] 
regex = re.compile(r'\b' + r'\b|\b'.join(sorted(words, key=len, reverse=True)) + r'\b') 


def replacer(m): 
    print 'replacing : %s' % m.group(0) 
    return 'term%d' % (words.index(m.group(0)) + 1) 

print re.sub(regex, replacer, text)

結果：

replacing : a 
replacing : abc 
replacing : abc 
replacing : a a abc 
I have term1 and term2 maybe term2 again and also term4

或使用匿名代用品：

print re.sub(regex, lambda m: 'term%d' % (words.index(m.group(0)) + 1), text)

來源

2017-09-03 10:53:44 voscausa

回答

相關問題