Python LXML iterparse與嵌套元素

我想檢索XML文件內的特定元素的內容。但是，在XML元素中，還有其他XML元素，這些元素會破壞父標記內的正確提取內容。一個例子：Python LXML iterparse與嵌套元素

xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>''' 

context = etree.iterparse(StringIO(xml), events=('end',), tag='claim-text') 
for event, element in context: 
    print element.text

這導致：

a. an upper body garment and a separate lower body garment 
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and; 
None

然而，例如， '保護性使用均勻..' 被錯過。看來，「索賠文本」中的每個元素都有其他內在因素被忽略。我應該如何更改XML的解析以獲取所有聲明？

感謝

我剛剛與「普通」 SAX解析器的方法解決了這個問題：

class SimpleXMLHandler(object): 

    def __init__(self): 
    self.buffer = '' 
    self.claim = 0 

    def start(self, tag, attributes): 
    if tag == 'claim-text': 
     if self.claim == 0: 
     self.buffer = '' 
     self.claim = 1 

    def data(self, data): 
    if self.claim == 1: 
     self.buffer += data 

    def end(self, tag): 
    if tag == 'claim-text': 
     print self.buffer 
     self.claim = 0 

    def close(self): 
    pass

來源

2011-04-20 labrassbandito

你可以使用XPath找到並串連直屬各<claim-text>節點的所有文本節點，像這樣：

from StringIO import StringIO 
from lxml import etree 
xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>''' 

context = etree.iterparse(StringIO(xml), events=('start',), tag='claim-text') 
for event, element in context: 
    print ''.join(element.xpath('text()'))

，其輸出：

. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: 
a. an upper body garment and a separate lower body garment 
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;

來源

2011-04-21 00:52:46 jsw

Python LXML iterparse與嵌套元素

回答

相關問題