2017-02-17 48 views
0

我試圖讓節點下的文本和id,見例如文件位置:example.xmlLXML Python中,閱讀文本和樹XML文件中的給定結構

但是,它不具有結構爲常規的XML文件。該結構如下:

<TextWithNodes><Node id="0"/> 
<Node id="1"/> 
<Node id="2"/>9407011<Node id="9"/> 
<Node id="10"/>ACL<Node id="13"/> <Node id="14"/>1994<Node id="18"/> 
<Node id="19"/> Lg.Pr.Dc <Node id="29"/> 

我想輸出是start_nodeend_nodetext_between_node列表。我不確定我是否可以使用lxml庫來做到這一點。

目前,我用

from lxml import etree 
tree = etree.parse('9407011.az-scixml.xml') 
nodes = tree.xpath('//TextWithNodes')[0].getchildren() 
node = nodes[0] # example one node 
print(node.text) # this give empty string because you don't have closing same id 
+0

請出示您嘗試使用發佈樣品或鏈接你期望的結果。 – Parfait

回答

1

使用XPath可能爲你工作。將normalize-space()與空字符串進行比較將消除沒有以下文本的節點。

這可能會爲你工作:

from lxml import etree as ET 
root = ET.XML(b'''<?xml version='1.0' encoding='UTF-8'?> 
<GateDocument version="3"> 
<TextWithNodes><Node id="0"/> 
<Node id="1"/> 
<Node id="2"/>9407011<Node id="9"/> 
<Node id="10"/>ACL<Node id="13"/> <Node id="14"/>1994<Node id="18"/> 
<Node id="19"/> Lg.Pr.Dc <Node id="29"/> 
</TextWithNodes></GateDocument>''') 

# Grab each 'Node' element: 
# Only if the element has an 'id' attribute, and only if 
# the first sibling is a text node that isn't 
# all wihtespace and only if 
# the second sibling is a 'Node' with an 'id' 
for r in root.xpath('''//Node[@id] 
          [following-sibling::node() 
           [1] 
           [self::text()] 
           [normalize-space() != ""]] 
          [following-sibling::node() 
           [2] 
           [self::Node[@id]]]'''): 
    # All elements that satisfy that above XPath should 
    # also satisfy the requirements for the next line 
    print (r.get('id'), repr(r.tail), r.getnext().get('id')) 
+0

這工作就像一個魅力,謝謝Rob! – titipata