2016-07-29 116 views
0

這裏是我的Python代碼LXMLLXML刪除展開文本標記內

import urllib.request 
from lxml import etree 
#import lxml.html as html 
from copy import deepcopy 
from lxml import etree 
from lxml import html 


some_xml_data = "<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>" 
root = etree.fromstring(some_xml_data) 
[c] = root.xpath('//span') 
print(etree.tostring(root)) #b'<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>' #output as expected 
#but if i do some changes 
for e in c.iterchildren("*"): 
    if e.tag == 'div': 
     e.getparent().remove(e) 

print(etree.tostring(root)) #b'<span>text1</span>' text2 and text3 removed! how to prevent this deletion? 

它看起來像後,我做LXML樹一些變化(刪除一些標籤) LXML還刪除了一些解開的文字!如何防止lxml這樣做並保存unwrpapped文本?

回答

1

節點的文本被稱爲,他們可以通過附加於母公司的文本被保留,這裏是一個示例:

In [1]: from lxml import html 

In [2]: s = "<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>" 
    ...: 

In [3]: tree = html.fromstring(s) 

In [4]: for node in tree.iterchildren("div"): 
    ...:  if node.tail: 
    ...:   node.getparent().text += node.tail 
    ...:  node.getparent().remove(node) 
    ...:  

In [5]: html.tostring(tree) 
Out[5]: b'<span>text1text2text3</span>' 

我用html因爲它更可能比XML結構。你可以簡單地使用iterchildrendiv來避免額外檢查標籤。