2011-04-20 92 views
1

我想檢索XML文件內的特定元素的內容。但是,在XML元素中,還有其他XML元素,這些元素會破壞父標記內的正確提取內容。一個例子:Python LXML iterparse與嵌套元素

xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>''' 

context = etree.iterparse(StringIO(xml), events=('end',), tag='claim-text') 
for event, element in context: 
    print element.text 

這導致:

a. an upper body garment and a separate lower body garment 
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and; 
None 

然而,例如, '保護性使用均勻..' 被錯過。看來,「索賠文本」中的每個元素都有其他內在因素被忽略。我應該如何更改XML的解析以獲取所有聲明?

感謝

我剛剛與「普通」 SAX解析器的方法解決了這個問題:

class SimpleXMLHandler(object): 

    def __init__(self): 
    self.buffer = '' 
    self.claim = 0 

    def start(self, tag, attributes): 
    if tag == 'claim-text': 
     if self.claim == 0: 
     self.buffer = '' 
     self.claim = 1 

    def data(self, data): 
    if self.claim == 1: 
     self.buffer += data 

    def end(self, tag): 
    if tag == 'claim-text': 
     print self.buffer 
     self.claim = 0 

    def close(self): 
    pass 

回答

2

你可以使用XPath找到並串連直屬各<claim-text>節點的所有文本節點,像這樣:

from StringIO import StringIO 
from lxml import etree 
xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>''' 

context = etree.iterparse(StringIO(xml), events=('start',), tag='claim-text') 
for event, element in context: 
    print ''.join(element.xpath('text()')) 

,其輸出:

. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: 
a. an upper body garment and a separate lower body garment 
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;