2
如何遍歷具有特定屬性值的特定值的所有標籤?例如,假設我們只需要data1,data2等。查找具有特定屬性值的所有標籤
<html>
<body>
<invalid html here/>
<dont care> ... </dont care>
<invalid html here too/>
<interesting attrib1="naah, it is not this"> ... </interesting tag>
<interesting attrib1="yes, this is what we want">
<group>
<line>
data
</line>
</group>
<group>
<line>
data1
<line>
</group>
<group>
<line>
data2
<line>
</group>
</interesting>
</body>
</html>
我試過BeautifulSoup但它無法解析文件。 LXML的解析器,似乎工作:
broken_html = get_sanitized_data(SITE)
parser = etree.HTMLParser()
tree = etree.parse(StringIO(broken_html), parser)
result = etree.tostring(tree.getroot(), pretty_print=True, method="html")
print(result)
我不熟悉它的API,我無法弄清楚如何爲使用getiterator或XPath。
您是否嘗試將MIME類型更改爲XML?有些解析器很挑剔... – JKirchartz 2010-09-23 12:57:11
使用xpath的lxml似乎很容易,給文檔一個機會:) http://codespeak.net/lxml/xpathxslt.html – 2010-09-23 13:04:57