1
我需要從大塊文本中提取可能的標題。例如,我想匹配「Joe Smith」,「The Firm」或「United States of America」等詞。我現在需要修改它以匹配以某種標題開頭的名稱(例如「Dr. Joe Smith」)。這裏的正則表達式我有:如何製作標題 - 正則表達式匹配前綴標題?
NON_CAPPED_WORDS = (
# Articles
'the',
'a',
'an',
# Prepositions
'about',
'after',
'as',
'at',
'before',
'by',
'for',
'from',
'in',
'into',
'like',
'of',
'on',
'to',
'upon',
'with',
'without',
)
TITLES = (
'Dr\.',
'Mr\.',
'Mrs\.',
'Ms\.',
'Gov\.',
'Sen\.',
'Rep\.',
)
# These are words that don't match the normal title case regex, but are still allowed
# in matches
IRREGULAR_WORDS = NON_CAPPED_WORDS + TITLES
non_capped_words_re = r'[\s:,]+|'.join(IRREGULAR_WORDS)
TITLE_RE = re.compile(r"""(?P<title>([A-Z0-9&][a-zA-Z0-9]*[\s,:-]*|{0})+\s*)""".format(non_capped_words_re))
它建立以下正則表達式:
(?P<title>([A-Z0-9&][a-zA-Z0-9]*[\s,:-]*|the[\s:,]+|a[\s:,]+|an[\s:,]+|about[\s:,]+|after[\s:,]+|as[\s:,]+|at[\s:,]+|before[\s:,]+|by[\s:,]+|for[\s:,]+|from[\s:,]+|in[\s:,]+|into[\s:,]+|like[\s:,]+|of[\s:,]+|on[\s:,]+|to[\s:,]+|upon[\s:,]+|with[\s:,]+|without[\s:,]+|Dr\.[\s:,]+|Mr\.[\s:,]+|Mrs\.[\s:,]+|Ms\.[\s:,]+|Gov\.[\s:,]+|Sen\.[\s:,]+|Rep\.)+\s*)
這似乎並不被雖然工作:
>>> whitelisting.TITLE_RE.findall('Dr. Joe Smith')
[('Dr', 'Dr'), ('Joe Smith', 'Smith')]
可有人誰擁有更好正則表達式幫助我解決這個正則表達式的混亂?
僅供參考,反斜槓不是在逃避你的'TITLES'的時期,因爲字符串不是原始文字 – Cameron 2011-05-27 01:30:17
@Cameron:「\」的r =='\。'' – 2011-05-27 01:37:31
@zerocrates是對的,但@Cameron是正確的,指出我應該更清楚一點。 :-) – 2011-05-27 01:38:58