LXML Python中，閱讀文本和樹XML文件中的給定結構

我試圖讓節點下的文本和id，見例如文件位置：example.xml LXML Python中，閱讀文本和樹XML文件中的給定結構

但是，它不具有結構爲常規的XML文件。該結構如下：

<TextWithNodes><Node id="0"/> 
<Node id="1"/> 
<Node id="2"/>9407011<Node id="9"/> 
<Node id="10"/>ACL<Node id="13"/> <Node id="14"/>1994<Node id="18"/> 
<Node id="19"/> Lg.Pr.Dc <Node id="29"/>

我想輸出是start_node，end_node和text_between_node列表。我不確定我是否可以使用lxml庫來做到這一點。

目前，我用

from lxml import etree 
tree = etree.parse('9407011.az-scixml.xml') 
nodes = tree.xpath('//TextWithNodes')[0].getchildren() 
node = nodes[0] # example one node 
print(node.text) # this give empty string because you don't have closing same id

來源

2017-02-17 titipata

請出示您嘗試使用發佈樣品或鏈接你期望的結果。 – Parfait

使用XPath可能爲你工作。將normalize-space()與空字符串進行比較將消除沒有以下文本的節點。

這可能會爲你工作：

from lxml import etree as ET 
root = ET.XML(b'''<?xml version='1.0' encoding='UTF-8'?> 
<GateDocument version="3"> 
<TextWithNodes><Node id="0"/> 
<Node id="1"/> 
<Node id="2"/>9407011<Node id="9"/> 
<Node id="10"/>ACL<Node id="13"/> <Node id="14"/>1994<Node id="18"/> 
<Node id="19"/> Lg.Pr.Dc <Node id="29"/> 
</TextWithNodes></GateDocument>''') 

# Grab each 'Node' element: 
# Only if the element has an 'id' attribute, and only if 
# the first sibling is a text node that isn't 
# all wihtespace and only if 
# the second sibling is a 'Node' with an 'id' 
for r in root.xpath('''//Node[@id] 
          [following-sibling::node() 
           [1] 
           [self::text()] 
           [normalize-space() != ""]] 
          [following-sibling::node() 
           [2] 
           [self::Node[@id]]]'''): 
    # All elements that satisfy that above XPath should 
    # also satisfy the requirements for the next line 
    print (r.get('id'), repr(r.tail), r.getnext().get('id'))

來源

2017-02-17 19:24:40

這工作就像一個魅力，謝謝Rob！ – titipata

LXML Python中，閱讀文本和樹XML文件中的給定結構

回答

相關問題