獲得Python中的正則表達式的所有實例

我嘗試使用以下獲得Python中的正則表達式的所有實例

import re 

s = '<div><a href="page1.html" title="page1">Go to 1</a>, <a href="page2.html" title="page2">Go to page 2</a><a href="page3.html" title="page3">Go to page 3</a>, <a href="page4.html" title="page4">Go to page 4</a></div>' 
match = re.findall(r'<a.*>(.*)</a>', s) 

for string in match: 
    print(string)

把所有的鏈接的innerHTML的，但我只得到了最後一次出現，「轉到第4頁」我認爲它看到一個大字符串和幾個匹配的正則表達式，它們被視爲重疊並被忽略。所以，我如何才能符合

集合[「轉到第1頁」，「轉到第2頁」，「轉到第3頁」，「轉到第4頁」]

來源

2013-07-26 SteveC

立即解決問題是regexp是貪婪的，那就是他們會嘗試消耗盡可能長的字符串。所以你是正確的，它發現直到最後</a>它可以。將其更改爲不貪婪（.*?）：

match = re.findall(r'<a.*?>(.*?)</a>', s) 
          ^

然而，這是解析HTML的一個可怕的方式，而不是穩健的，並且將打破上最小的變化。

這裏做的更好的方法：

from bs4 import BeautifulSoup 

s = '<div><a href="page1.html" title="page1">Go to 1</a>, <a href="page2.html" title="page2">Go to page 2</a><a href="page3.html" title="page3">Go to page 3</a>, <a href="page4.html" title="page4">Go to page 4</a></div>' 
soup = BeautifulSoup(s) 
print [el.string for el in soup('a')] 
# [u'Go to 1', u'Go to page 2', u'Go to page 3', u'Go to page 4']

然後，您可以使用的電源也得到了HREF以及文字，如：

print [[el.string, el['href'] ]for el in soup('a', href=True)] 
# [[u'Go to 1', 'page1.html'], [u'Go to page 2', 'page2.html'], [u'Go to page 3', 'page3.html'], [u'Go to page 4', 'page4.html']]

來源

2013-07-26 22:38:05

謝謝！我真的不太明白？在正則表達式中，這是一個很好的學習經驗。這裏是我的工作 match = re.findall（r'（。*？）'，s） – SteveC

@ user1450120我沒有看到其他。* :)無論如何 - 期待這個打破以後或可能會返回錯誤的結果......請看使用'beautifulsoup'解析HTML - 這很容易學習和靈活 –

什麼樣的輸入可能會導致此問題被破壞？ – SteveC

我建議使用lxml：

from lxml import etree 

s = 'some html' 
tree = etree.fromstring(s) 
for ele in tree.iter('*'): 
    #do something

它爲大文件處理提供了iterParse函數，並且還帶入了像urll這樣的文件類對象ib2.request對象。我一直在使用它很長一段時間來解析html和xml。

參見：http://lxml.de/tutorial.html#the-element-class

來源

2013-07-26 22:45:51 Mai

我會避免在解析使用正則表達式HTML ALL成本。根據原因檢查出this article和this SO post。但概括起來......

試圖解析使用正則表達式HTML每一次，邪惡的孩子哭處女的血，和俄羅斯的黑客PWN你的web應用

相反，我會採取看看一個python HTML解析包，如BeautifulSoup或pyquery。它們提供了很好的界面來遍歷，檢索和編輯HTML。

來源

2013-07-26 22:48:33 FastTurtle

獲得Python中的正則表達式的所有實例

回答

相關問題