BeautifulSoup/LXML.html：如果孩子看起來像x，則刪除標記及其子項

我遇到問題才能找到正確的解決方案。如果<answer> = 99，我想要刪除<question>及其子級。因此，我需要一個帶過濾問題的字符串。我有以下HTML結構：BeautifulSoup/LXML.html：如果孩子看起來像x，則刪除標記及其子項

<html> 
<body>   
    <questionaire> 
    <question> 
    <questiontext> 
    Do I have a question? 
    </questiontext> 
    <answer> 
    99 
    </answer> 
    </question> 
    <question> 
    <questiontext> 
    Do I love HTML/XML parsing? 
    </questiontext> 
    <questalter> 
    <choice> 
     1 oh god yeah 
    </choice> 
    <choice> 
     2 that makes me feel good 
    </choice> 
    <choice> 
     3 oh hmm noo 
    </choice> 
    <choice> 
     4 totally 
    </choice> 
    </questalter> 
    <answer> 
     4 
    </answer> 
    </question> 
    <question> 
    </questionaire> 
</body> 
</html>

到目前爲止，我試着用XPath來實現它......但lxml.html沒有iterparse ......有嗎？感謝名單！

來源

2011-10-06 Jurudocs

這將不正是你所需要的：

from xml.dom import minidom 

doc = minidom.parseString(text) 
for question in doc.getElementsByTagName('question'): 
    for answer in question.getElementsByTagName('answer'): 
     if answer.childNodes[0].nodeValue.strip() == '99': 
      question.parentNode.removeChild(question) 

print doc.toxml()

結果：

<html> 
<body>   
    <questionaire> 

    <question> 
    <questiontext> 
    Do I love HTML/XML parsing? 
    </questiontext> 
    <questalter> 
    <choice> 
     1 oh god yeah 
    </choice> 
    <choice> 
     2 that makes me feel good 
    </choice> 
    <choice> 
     3 oh hmm noo 
    </choice> 
    <choice> 
     4 totally 
    </choice> 
    </questalter> 
    <answer> 
     4 
    </answer> 
    </question> 
    </questionaire> 
</body> 
</html>

來源

2011-10-06 20:19:48

嗨馬特感謝你的答案......這看起來很複雜......我不知道是否有是BeautifulSoup還是lxml的解決方案...？ – Jurudocs

我更新了我的答案，以便它可以與你的html一起工作。要警告你最後有一個''，這會導致解析錯誤。 –

非常感謝你......我發現minidom太可怕了，但這看起來不錯！我個人更喜歡lxml ...我希望我能接受兩個答案;-) – Jurudocs

from lxml import etree 
html = etree.fromstring(html_string) 
questions = html.xpath('/html/body/questionaire/question') 
for question in questions: 
    for elements in question.getchildren(): 
     if element.tag == 'answer' and '99' in element.text: 
      html.xpath('/html/body/questionaire')[0].remove(question) 
print etree.tostring(html)

來源

2011-10-06 20:25:37

BeautifulSoup/LXML.html：如果孩子看起來像x，則刪除標記及其子項

回答

相關問題