簡單的正則表達式問題

我有兩個幾乎理想的表達式，我得到一個很好的和另一種方式錯誤的輸出。簡單的正則表達式問題

data/holidays/photos-2012-2013/word-another-more-more-5443/"><span class="bold">word another</span> - word</a>  

regex = 'data/holidays/photos-2012-2013/.+?(\d{4})/"><span class="bold">(.+?)</span>(.+?)</a>'

word-another-more-more，word another和word，這一切都在上述變化。以上正確打印出來，這樣的元組的列表： ('6642', 'word another', ' - word')

data/holidays/photos-2012-2013/word-another-more-more-5443/">word- another - <span class="bold">word another</span></a> 

regex1 = 'data/holidays/photos-2012-2013/.+?(\d{4})/">(.+?)<span class="bold">(.+?)</span></a>'

這上面打印出一些垃圾代碼，儘管使用的語法是idential。輸出也是一個包含元組的列表，但充滿了不需要的代碼。

你能看到第二個正則表達式有什麼不對嗎？

來源

2013-03-08 nutship

請不要試圖用正則表達式解析HTML。爲什麼不使用HTML解析器呢？ – 2013-03-08 22:20:48

如果這麼簡單，爲什麼你需要幫助？ :-) – paxdiablo 2013-03-08 22:20:55

我同意Martijn Pieters，使用正則表達式幾乎可以保證解析HTML失敗;如果您可以使用XML/HTML解析器，則更有可能成功。除此之外，作爲一般建議，我會說嘗試通過http://regexpal.com/上的模擬器運行輸入，並查看您的正則表達式是否按照您認爲的方式工作。 – neilr8133 2013-03-08 22:23:02

工作對我來說：

>>> import re 
>>> text = 'data/holidays/photos-2012-2013/word-another-more-more-5443/">word- another - <span class="bold">word another</span></a>' 
>>> re.findall(r'data/holidays/photos-2012-2013/.+?(\d{4})/">(.+?)<span class="bold">(.+?)</span></a>', text) 
[('5443', 'word- another - ', 'word another')]

注：請不與正則表達式解析HTML。 BeautifulSoup只是因爲這個原因而存在。

來源

2013-03-08 22:22:10 nneonneo

回答

相關問題