重複標點符號和符號的Python正則表達式

我需要一個匹配重複（多於一個）標點和符號的正則表達式。基本上所有重複的非字母數字和非空白字符，如...，???，!!!，###，@@@，+++等等。它必須是重複的相同字符，所以不能像「！？@」這樣的序列。重複標點符號和符號的Python正則表達式

我曾試過[^ \ s \ w] +，雖然涵蓋了所有的!!!，???，$$$情況，但是這給了我比我想要的更多，因爲它也會匹配「！@」。

有人能夠賜教嗎？謝謝。

來源

2013-02-01 user2017502

S/O是用於幫助解決代碼問題 - 不是爲您編寫代碼 - 查看're'的文檔並嘗試使用 –

嘗試此圖案：

([.\?#@+,<>%~`!$^&\(\):;])\1+

\1指的是第一個匹配的基團，其是括號的內容。

您需要根據需要擴展標點符號列表。

來源

2013-02-01 02:53:55

Python，AFAIK中不支持。 – nhahtdh

也就是'\ p {P}'和'\ p {S}'。反向引用部分是。 –

@nhahtdh更新了答案。 –

編輯：@Firoze Lafeer發佈了一個答案，用一個正則表達式來完成所有事情。如果任何人有興趣將正則表達式與過濾函數結合起來，我會留下來，但對於這個問題，使用Firoze Lafeer的答案會更簡單快捷。

在我看到Firoze Lafeer的答案之前寫的答案在下面，不變。

一個簡單的正則表達式不能做到這一點。經典的簡潔摘要是「正則表達式無法計數」。這裏討論：

How to check that a string is a palindrome using regular expressions?

對於Python的解決辦法，我建議正則表達式用Python代碼一點點的結合。正則表達式拋出所有不是某種標點符號的運行，然後Python代碼檢查是否拋出錯誤匹配（匹配是標點符號而不是全部相同字符）。

import re 
import string 

# Character class to match punctuation. The dash ('-') is special 
# in character classes, so put a backslash in front of it to make 
# it just a literal dash. 
_char_class_punct = "[" + re.escape(string.punctuation) + "]" 

# Pattern: a punctuation character followed by one or more punctuation characters. 
# Thus, a run of two or more punctuation characters. 
_pat_punct_run = re.compile(_char_class_punct + _char_class_punct + '+') 

def all_same(seq, basis_case=True): 
    itr = iter(seq) 
    try: 
     first = next(itr) 
    except StopIteration: 
     return basis_case 
    return all(x == first for x in itr) 

def find_all_punct_runs(text): 
    return [s for s in _pat_punct_run.findall(text) if all_same(s, False)] 


# alternate version of find_all_punct_runs() using re.finditer() 
def find_all_punct_runs(text): 
    return (s for s in (m.group(0) for m in _pat_punct_run.finditer(text)) if all_same(s, False))

我寫all_same()我這樣做了，它會很好的工作在一個迭代器作爲一個字符串的方式。 Python內置的all()爲空序列返回True，這不是我們想要的all_same()的特定用法，所以我爲所需的基本情況提出了一個參數，並使其默認爲True以匹配all()的行爲。

儘可能多地使用Python的內部工作（正則表達式引擎或all()），所以它應該非常快。對於大輸入文本，您可能需要重寫find_all_punct_runs()以使用re.finditer()而不是re.findall()。我舉了一個例子。該示例還返回一個生成器表達式而不是一個列表。你總是可以迫使它做一個清單：

lst = list(find_all_punct_runs(text))

來源

2013-02-01 03:23:18 steveha

'-'和'['（不確定Python）和']'在字符類中是特殊的，所以在開始時也是'^'。 – nhahtdh

改爲嘗試使用're.escape（string.punctuation）'。這樣可行。（確認它是正確的：'對於string.punctuation中的字母，all（re.match（'[％s]'％re.escape（string.punctuation），letter）== True。） –

@ChrisMorgan：Wow ，那太好了。很明顯它在做什麼，我不需要擔心我是否做對了。 – steveha

這是我會怎麼做：

>>> st='non-whitespace characters such as ..., ???, !!!, ###, @@@, +++ and' 
>>> reg=r'(([.?#@+])\2{2,})' 
>>> print [m.group(0) for m in re.finditer(reg,st)]

或

>>> print [g for g,l in re.findall(reg, st)]

任一個打印：

['...', '???', '###', '@@@', '+++']

來源

2013-02-01 03:33:25 dawg

我認爲你正在尋找像這樣的東西：

[run for run, leadchar in re.findall(r'(([^\w\s])\2+)', yourstring)]

例子：

In : teststr = "4spaces then(*(@^#$&&&&(2((((99999****" 

In : [run for run, leadchar in re.findall(r'(([^\w\s])\2+)',teststr)] 
Out: ['&&&&', '((((', '****']

這使您可以運行的列表，但不包括在字符串中的4位，以及像 '*（@ ^'

如果序列這不完全是你想要的，你可以用一個示例字符串編輯你的問題，並且準確地輸出你想看到的輸出。

來源

2013-02-01 03:46:38

重複標點符號和符號的Python正則表達式

回答

相關問題