如果您需要更多的靈活性,以匹配條目描述,你可以結合nltk
和re
from nltk.stem import PorterStemmer
import re
假設你有相同的不同描述事件即。 重寫系統。您可以使用nltk.stem
捕捉重寫,重寫,重寫,單數和複數形式等
master_list = [
'There are many types of intrusion detection devices in production today.',
'The CTO approved a rewrite of the system',
'The CTO is about to approve a complete rewrite of the system',
'The CTO approved a rewriting',
'Breaching of Firewalls'
]
terms = [
'Intrusion Detection',
'Approved rewrite',
'Firewall'
]
stemmer = PorterStemmer()
# for each term, split it into words (could be just one word) and stem each word
stemmed_terms = ((stemmer.stem(word) for word in s.split()) for s in terms)
# add 'match anything after it' expression to each of the stemmed words
# join result into a pattern string
regex_patterns = [''.join(stem + '.*' for stem in term) for term in stemmed_terms]
print(regex_patterns)
print('')
for sentence in master_list:
match_obs = (re.search(pattern, sentence, flags=re.IGNORECASE) for pattern in regex_patterns)
matches = [m.group(0) for m in match_obs if m]
print(matches)
輸出:
['Intrus.*Detect.*', 'Approv.*rewrit.*', 'Firewal.*']
['intrusion detection devices in production today.']
['approved a rewrite of the system']
['approve a complete rewrite of the system']
['approved a rewriting']
['Firewalls']
編輯:
要查看其中terms
造成匹配:
for sentence in master_list:
# regex_patterns maps directly onto terms (strictly speaking it's one-to-one and onto)
for term, pattern in zip(terms, regex_patterns):
if re.search(pattern, sentence, flags=re.IGNORECASE):
# process term (put it in the db)
print('TERM: {0} FOUND IN: {1}'.format(term, sentence))
輸出:
TERM: Intrusion Detection FOUND IN: There are many types of intrusion detection devices in production today.
TERM: Approved rewrite FOUND IN: The CTO approved a rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO is about to approve a complete rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO approved a rewriting
TERM: Firewall FOUND IN: Breaching of Firewalls
你能提供一些示例數據?這個問題有點不清楚 – Daenyth 2011-05-24 00:38:22