如果啓用start
事件,您可以通過使用堆棧跟蹤祖先節點。如果你真的想要鎮壓一個<feature>
的所有後代,而不是僅僅是兒童,你可以使用一個簡單的標誌,如另一個答案中所示。您可以使用root.clear()
來吹走所有已完成的元素。閱讀this。
代碼:
import xml.etree.cElementTree as et
# Produces identical answers with import lxml.etree as et
import cStringIO
def normtext(t):
return repr("" if t is None else t.strip())
def dump(el):
print el.tag, normtext(el.text), normtext(el.tail), el.attrib
def my_filtered_elements(source, skip_parent_tag="feature"):
# get an iterable
context = et.iterparse(source, events=("start", "end"))
# turn it into an iterator
context = iter(context)
# get the root element
event, root = context.next()
tag_stack = [None, root.tag]
for event, elem in context:
# print event, elem.tag, tag_stack
if event == "start":
tag_stack.append(elem.tag)
else:
assert event == "end"
my_tag = tag_stack.pop()
assert my_tag == elem.tag
parent_tag = tag_stack[-1]
if parent_tag is not None and parent_tag != skip_parent_tag:
dump(elem)
# yield elem
root.clear()
def other_filtered_elements(source, skip_parent_tag="feature"):
in_feature_tag = False
for event, element in et.iterparse(source, events=('start', 'end')):
if element.tag == skip_parent_tag:
in_feature_tag = event == 'start'
if event == 'end' and not in_feature_tag:
dump(element)
test_input = """
<top>
<lev1 guff="1111">
<lev2>aaaaa</lev2>
<lev2>bbbbb</lev2>
</lev1>
<feature>
feat text 1
<fchild>fcfcfcfc
<fgchild>ggggg</fgchild>
</fchild>
feat text 2
</feature>
<lev1 guff="2222">
<lev2>ccccc</lev2>c-tail
<lev2>ddddd</lev2>d-tail
<notext1></notext1>e-tail
<notext2 />f-tail
</lev1>g-tail
</top>
"""
print "=== me ==="
my_filtered_elements(cStringIO.StringIO(test_input))
print "=== other ==="
other_filtered_elements(cStringIO.StringIO(test_input))
輸出如下。您會注意到從lev1
節點root.clear()
不會吹走尚未完全解析的元素。這意味着,存儲器使用量是O(樹的深度),而不是O(樹中元素的總數量)
=== me ===
lev2 'aaaaa' '' {}
lev2 'bbbbb' '' {}
lev1 '' '' {'guff': '1111'}
fgchild 'ggggg' '' {} <<<=== do you want this?
feature 'feat text 1' '' {}
lev2 'ccccc' 'c-tail' {}
lev2 'ddddd' 'd-tail' {}
notext1 '' 'e-tail' {}
notext2 '' 'f-tail' {}
lev1 '' 'g-tail' {'guff': '2222'}
=== other ===
lev2 'aaaaa' '' {}
lev2 'bbbbb' '' {}
lev1 '' '' {'guff': '1111'}
feature 'feat text 1' '' {}
lev2 'ccccc' 'c-tail' {}
lev2 'ddddd' 'd-tail' {}
notext1 '' 'e-tail' {}
notext2 '' 'f-tail' {}
lev1 '' 'g-tail' {'guff': '2222'}
top '' '' {} <<<=== do you want this?
可以與['lxml'](HTTP做到這一點:// LXML。德/)。它具有'getparent()' – reclosedev 2012-02-17 17:40:52
根據這裏的基準:http://effbot.org/zone/celementtree.htm(我不知道這有多可信)但是,cElementTree的速度更快。所以我想堅持cElementTree – 2012-02-17 17:47:47
@NiekdeKlein:您還應該閱讀http://lxml.de/1.3/performance.html ...「對於需要高解析器吞吐量和少量序列化的應用程序,cET是最好的也適用於從大型XML數據集中提取少量數據的iterparse應用程序,但如果涉及到往返性能,lxml往往會快3-4倍,因此,只要輸入文檔不是很大大於產量,lxml是明顯的贏家。「 – 2012-02-17 20:46:32