我想使用re模塊從一個字符串中提取所有的html節點，包括所有的attrs。但是，我希望每個屬性都是一個組，這意味着我可以使用matchobj.group()來獲取它們。節點中attrs的數量是柔性的。這是我困惑的地方。我不知道如何編寫這樣的正則表達式。我試過</?(\w+)(\s\w+[^>]*?)*/?>'，但對於像<a href='aaa' style='bbb'>這樣的節點，我只能得到[('a'), ('style="bbb")]這兩個組。
我知道有一些很好的HTML解析器。但實際上我不會提取attrs的價值。我需要修改原始字符串。使用正則表達式來提取所有html attrs

2013-06-28 zhangyangyu

FFS ... http://www.crummy.com/software/BeautifulSoup/ –

考慮使用HTML解析器代替正則表達式。 http://www.crummy.com/software/BeautifulSoup/ – Achrome

正常第一場比賽被第二場比賽覆蓋。 –

說明

要捕捉屬性的無限數量也將需要兩個步驟，你拉其中第一整個元素。然後你遍歷元素並獲得一組匹配的屬性。

正則表達式來抓住所有的元素：<\w+(?=\s|>)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>

enter image description here

正則表達式抓住所有從單個元素的屬性：\s\w+=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*)(?=\s|>)

enter image description here

Python的實施例

查看工作例如：http://repl.it/J0t/4

代碼

import re 

string = """ 
<a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>text</a> 
"""; 

for matchElementObj in re.finditer(r'<\w+(?=\s|>)(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?>', string, re.M|re.I|re.S): 
    print "-------" 
    print "matchElementObj.group(0) : ", matchElementObj.group(0) 

    for matchAttributesObj in re.finditer(r'\s\w+=(?:\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*)(?=\s|>)', string, re.M|re.I|re.S): 
     print "matchAttributesObj.group(0) : ", matchAttributesObj.group(0)

輸出

------- 
matchElementObj.group(0) : <a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie> 
matchAttributesObj.group(0) : href="i.like.kittens.com" 
matchAttributesObj.group(0) : NotRealAttribute=' true="4>2"' 
matchAttributesObj.group(0) : class=Fonzie

來源

2013-06-28 03:02:17

Please don't use regex。使用BeautifulSoup：

>>> from bs4 import BeautifulSoup as BS 
>>> html = """<a href='aaa' style='bbb'>""" 
>>> soup = BS(html) 
>>> mytag = soup.find('a') 
>>> print mytag['href'] 
aaa 
>>> print mytag['style'] 
bbb

或者，如果你想要一本字典：

>>> print mytag.attrs 
{'style': 'bbb', 'href': 'aaa'}

來源

2013-06-28 01:56:20 TerryA

我知道HTML解析器應該是很好的選擇，但實際上我不認爲它們可以爲我工作。我需要修改原始字符串。 – zhangyangyu

@zhangyangyu看看[這個]（http://www.crummy.com/software/BeautifulSoup/bs4/doc/#replace-with）也許 – TerryA

請問downvoter請澄清他們爲什麼downvoted – TerryA

使用正則表達式來提取所有html attrs

回答

說明

Python的實施例

相關問題