蟒蛇應用re.sub非貪婪代替失敗，並在字符串中換行

我在Python達成一個問題用正則表達式（2.7.9）蟒蛇應用re.sub非貪婪代替失敗，並在字符串中換行

我想去掉HTML中使用<span>標籤像這樣一個正則表達式：

re.sub(r'<span[^>]*>(.*?)</span>', r'\1', input_text, re.S)

（正則表達式讀取正是如此：<span，任何不是一個>，則>，則非貪婪匹配任何東西，隨後</span>，並使用re.S （re.DOTALL），因此.與換行符匹配字符

這似乎工作，除非文本中有換行符。它看起來像re.S（DOTALL）不適用於非貪婪的比賽。

下面是測試代碼;從text1和re.sub中刪除換行符。重新放入，並且re.sub失敗。將新行字符放在<span>標記之外，並且re.sub起作用。

#!/usr/bin/env python 
import re 
text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>' 
print repr(text1) 
text2 = re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, re.S) 
print repr(text2)

爲了比較，我寫了一個Perl腳本來做同樣的事情;正則表達式正如我期望的那樣工作。

#!/usr/bin/perl 
$text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>'; 
print "$text1\n"; 
$text1 =~ s/<span[^>]*>(.*?)<\/span>/\1/s; 
print "$text1\n";

任何想法？

測試在Python 2.6.6和Python 2.7.9

來源

2016-04-18 Andy Watkins

Anoth呃相同的問題：http://stackoverflow.com/questions/42581/python-re-sub-multiline-caret-match。這個問題很常見。答案是：閱讀[docs]（https://docs.python.org/2/library/re.html#re.sub）。 –

的re.sub的第四參數是一個count，而不是一個flags。

re.sub(pattern, repl, string, count=0, flags=0)¶

您需要使用關鍵字參數明確指定flags：

re.sub(r'<span[^>]*>(.*?)</span>', r'\1', input_text, flags=re.S) 
                 ↑↑↑↑↑↑

否則，re.S將被解釋更換計數（最大16倍），而不是S（或DOTALL標誌）：

>>> import re 
>>> re.S 
16 

>>> text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>' 

>>> re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, re.S) 
'<body id="aa">this is a <span color="red">test\n with newline</span></body>' 

>>> re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, flags=re.S) 
'<body id="aa">this is a test\n with newline</body>'

來源

2016-04-18 09:18:47 falsetru

謝謝@falsetru，解決了它。（Phew！）有趣的一面是，Python 2.6中不能識別標誌參數，所以我們使用2.7 –

蟒蛇應用re.sub非貪婪代替失敗，並在字符串中換行

回答

相關問題