在python中用正則表達式捕獲所有連續的全部大寫單詞？

我想在Python中使用正則表達式匹配所有連續的所有大寫單詞/短語。鑑於以下幾點：在python中用正則表達式捕獲所有連續的全部大寫單詞？

text = "The following words are ALL CAPS. The following word is in CAPS."

的代碼將返回：

ALL CAPS, CAPS

我目前正在使用：

matches = re.findall('[A-Z\s]+', text, re.DOTALL)

但這返回：

['T', ' ', ' ', ' ', ' ALL CAPS', ' T', ' ', ' ', ' ', ' ', ' CAPS']

我清楚地穿上」不要標點符號或'T'。我只想返回僅包含所有大寫字母的連續單詞或單個單詞。

感謝

來源

2017-04-20 BHudson

你有什麼期望時，單詞由一個空格，比如'ABC.DEF不分開'？ –

爲什麼你使用're.DOTALL'選項，因爲你的模式中沒有點？ –

它只是從另一個命令複製而來。它不會改變輸出。正則表達式非常新，所以當然不會這樣做。 – BHudson

這一個做這項工作：

import re 
text = "tHE following words aRe aLL CaPS. ThE following word Is in CAPS." 
matches = re.findall(r"(\b(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+)\b(?:\s+(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+)\b)*)",text) 
print matches

輸出：

['tHE', 'aLL CaPS', 'ThE', 'Is', 'CAPS']

說明：

(   : start group 1 
    \b  : word boundary 
    (?:  : start non capture group 
    [A-Z]+ : 1 or more capitals 
    [a-z]? : 0 or 1 small letter 
    [A-Z]* : 0 or more capitals 
    |  : OR 
    [A-Z]* : 0 or more capitals 
    [a-z]? : 0 or 1 small letter 
    [A-Z]+ : 1 or more capitals 
)   : end group 
    \b  : word boundary 
    (?:  : non capture group 
    \s+  : 1 or more spaces 
    (?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+) : same as above 
    \b  : word boundary 
)*  : 0 or more time the non capture group 
)   : end group 1

來源

2017-04-20 15:20:02 Toto

謝謝！接受了答案。這工作完美。任何機會，你可以幫助添加一個標誌*允許*（不需要）在捕獲的字符串中的一個小寫字符？我認爲要求一個小寫字母就像是'（？=。* [az]）'，但我想要得到像這樣的短語/詞 - 'ALL CaPS'或'CaPS'，還有'ALL CAPS' ，CAPS'。再次感謝 – BHudson

@Budson：這些話必須以大寫字母開頭嗎？關於字符串開頭的'The'怎麼樣？ – Toto

他們不需要以大寫字母開頭，但只有一個字符應該在大寫字母中。我有一個巨大的文件，名稱全部大寫，但很多拼寫錯誤。我將提取所有名稱，並使用模糊匹配來糾正它們。有些錯誤地有一個小寫字母。所以，我需要匹配'tHE'或'THE'，但不要'''（我可能會用'JOHN SMITh'或'jOHN SMiTH'）。謝謝 – BHudson

你的正則表達式是依靠明確的條件（後封空間）。

matches = re.findall(r"([A-Z]+\s?[A-Z]+[^a-z0-9\W])",text)

如果沒有尾隨小寫或無字母字符，則捕獲A到Z的重複。

來源

2017-04-20 15:07:00 Dashadower

OP表示「ALL CAPS」應該是1個匹配組，並且您錯過了A case – Tezra

@Tezra編輯以符合要求。 – Dashadower

假設你想開始和一個字母結束，並且只包含字母和空格

\b([A-Z][A-Z\s]*[A-Z]|[A-Z])\b

| [AZ]捕捉只是我或A

來源

2017-04-20 15:08:17 Tezra

保持你的正則表達式，你可以使用strip()和filter：

string = "The following words are ALL CAPS. The following word is in CAPS." 
result = filter(None, [x.strip() for x in re.findall(r"\b[A-Z\s]+\b", string)]) 
# ['ALL CAPS', 'CAPS']

來源

2017-04-20 15:28:14

在python中用正則表達式捕獲所有連續的全部大寫單詞？

回答

相關問題