2010-09-23 255 views
2

如何遍歷具有特定屬性值的特定值的所有標籤?例如,假設我們只需要data1,data2等。查找具有特定屬性值的所有標籤

<html> 
    <body> 
     <invalid html here/> 
     <dont care> ... </dont care> 
     <invalid html here too/> 
     <interesting attrib1="naah, it is not this"> ... </interesting tag> 
     <interesting attrib1="yes, this is what we want"> 
      <group> 
       <line> 
        data 
       </line> 
      </group> 
      <group> 
       <line> 
        data1 
       <line> 
      </group> 
      <group> 
       <line> 
        data2 
       <line> 
      </group> 
     </interesting> 
    </body> 
</html> 

我試過BeautifulSoup但它無法解析文件。 LXML的解析器,似乎工作:

broken_html = get_sanitized_data(SITE) 

parser = etree.HTMLParser() 
tree = etree.parse(StringIO(broken_html), parser) 

result = etree.tostring(tree.getroot(), pretty_print=True, method="html") 

print(result) 

我不熟悉它的API,我無法弄清楚如何爲使用getiterator或XPath。

+0

您是否嘗試將MIME類型更改爲XML?有些解析器很挑剔... – JKirchartz 2010-09-23 12:57:11

+2

使用xpath的lxml似乎很容易,給文檔一個機會:) http://codespeak.net/lxml/xpathxslt.html – 2010-09-23 13:04:57

回答

3

這裏有一種方法,使用lxml和XPath'descendant::*[@attrib1="yes, this is what we want"]'。 XPath通知lxml查看當前節點的所有後代,並返回attrib1屬性等於"yes, this is what we want"的屬性。

import lxml.html as lh 
import cStringIO 

content=''' 
<html> 
    <body> 
     <invalid html here/> 
     <dont care> ... </dont care> 
     <invalid html here too/> 
     <interesting attrib1="naah, it is not this"> ... </interesting tag> 
     <interesting attrib1="yes, this is what we want"> 
      <group> 
       <line> 
        data 
       </line> 
      </group> 
      <group> 
       <line> 
        data1 
       <line> 
      </group> 
      <group> 
       <line> 
        data2 
       <line> 
      </group> 
     </interesting> 
    </body> 
</html> 
''' 
doc=lh.parse(cStringIO.StringIO(content)) 
tags=doc.xpath('descendant::*[@attrib1="yes, this is what we want"]') 
print(tags) 
# [<Element interesting at b767e14c>] 
for tag in tags: 
    print(lh.tostring(tag)) 
# <interesting attrib1="yes, this is what we want"><group><line> 
#      data 
#     </line></group><group><line> 
#      data1 
#     <line></line></line></group><group><line> 
#      data2 
#     <line></line></line></group></interesting> 
+0

謝謝,你救了我的一天! – 2010-09-23 16:14:23

相關問題