2011-10-06 22 views
0

我遇到問題才能找到正確的解決方案。如果<answer> = 99,我想要刪除<question>及其子級。因此,我需要一個帶過濾問題的字符串。我有以下HTML結構:BeautifulSoup/LXML.html:如果孩子看起來像x,則刪除標記及其子項

<html> 
<body>   
    <questionaire> 
    <question> 
    <questiontext> 
    Do I have a question? 
    </questiontext> 
    <answer> 
    99 
    </answer> 
    </question> 
    <question> 
    <questiontext> 
    Do I love HTML/XML parsing? 
    </questiontext> 
    <questalter> 
    <choice> 
     1 oh god yeah 
    </choice> 
    <choice> 
     2 that makes me feel good 
    </choice> 
    <choice> 
     3 oh hmm noo 
    </choice> 
    <choice> 
     4 totally 
    </choice> 
    </questalter> 
    <answer> 
     4 
    </answer> 
    </question> 
    <question> 
    </questionaire> 
</body> 
</html>  

到目前爲止,我試着用XPath來實現它......但lxml.html沒有iterparse ......有嗎?感謝名單!

回答

1

這將不正是你所需要的:

from xml.dom import minidom 

doc = minidom.parseString(text) 
for question in doc.getElementsByTagName('question'): 
    for answer in question.getElementsByTagName('answer'): 
     if answer.childNodes[0].nodeValue.strip() == '99': 
      question.parentNode.removeChild(question) 

print doc.toxml() 

結果:

<html> 
<body>   
    <questionaire> 

    <question> 
    <questiontext> 
    Do I love HTML/XML parsing? 
    </questiontext> 
    <questalter> 
    <choice> 
     1 oh god yeah 
    </choice> 
    <choice> 
     2 that makes me feel good 
    </choice> 
    <choice> 
     3 oh hmm noo 
    </choice> 
    <choice> 
     4 totally 
    </choice> 
    </questalter> 
    <answer> 
     4 
    </answer> 
    </question> 
    </questionaire> 
</body> 
</html> 
+0

嗨馬特感謝你的答案......這看起來很複雜......我不知道是否有是BeautifulSoup還是lxml的解決方案...? – Jurudocs

+0

我更新了我的答案,以便它可以與你的html一起工作。要警告你最後有一個'',這會導致解析錯誤。 –

+0

非常感謝你......我發現minidom太可怕了,但這看起來不錯!我個人更喜歡lxml ...我希望我能接受兩個答案;-) – Jurudocs

1
from lxml import etree 
html = etree.fromstring(html_string) 
questions = html.xpath('/html/body/questionaire/question') 
for question in questions: 
    for elements in question.getchildren(): 
     if element.tag == 'answer' and '99' in element.text: 
      html.xpath('/html/body/questionaire')[0].remove(question) 
print etree.tostring(html)