2012-02-17 13 views
2

我想循環一個xml文件的元素併產生每個元素,除非父元素是一個特徵。如何在使用cElementTree的iterparse方法時知道元素的父級?

所以在僞

for event, element in cElementTree.iterparse('../test.xml'): 
     if parentOf_element != 'feature': 
     yield element 

我怎樣才能獲得該元素的父?我知道使用tree.getiterator()函數是可能的,但我不想構建完整的樹,因爲xml文件很大。

+0

可以與['lxml'](HTTP做到這一點:// LXML。德/)。它具有'getparent()' – reclosedev 2012-02-17 17:40:52

+0

根據這裏的基準:http://effbot.org/zone/celementtree.htm(我不知道這有多可信)但是,cElementTree的速度更快。所以我想堅持cElementTree – 2012-02-17 17:47:47

+0

@NiekdeKlein:您還應該閱讀http://lxml.de/1.3/performance.html ...「對於需要高解析器吞吐量和少量序列化的應用程序,cET是最好的也適用於從大型XML數據集中提取少量數據的iterparse應用程序,但如果涉及到往返性能,lxml往往會快3-4倍,因此,只要輸入文檔不是很大大於產量,lxml是明顯的贏家。「 – 2012-02-17 20:46:32

回答

1

如果啓用start事件,您可以通過使用堆棧跟蹤祖先節點。如果你真的想要鎮壓一個<feature>的所有後代,而不是僅僅是兒童,你可以使用一個簡單的標誌,如另一個答案中所示。您可以使用root.clear()來吹走所有已完成的元素。閱讀this

代碼:

import xml.etree.cElementTree as et 
# Produces identical answers with import lxml.etree as et 
import cStringIO 

def normtext(t): 
    return repr("" if t is None else t.strip()) 

def dump(el): 
    print el.tag, normtext(el.text), normtext(el.tail), el.attrib 

def my_filtered_elements(source, skip_parent_tag="feature"): 
    # get an iterable 
    context = et.iterparse(source, events=("start", "end")) 
    # turn it into an iterator 
    context = iter(context) 
    # get the root element 
    event, root = context.next() 
    tag_stack = [None, root.tag] 
    for event, elem in context: 
     # print event, elem.tag, tag_stack 
     if event == "start": 
      tag_stack.append(elem.tag) 
     else: 
      assert event == "end" 
      my_tag = tag_stack.pop() 
      assert my_tag == elem.tag 
      parent_tag = tag_stack[-1] 
      if parent_tag is not None and parent_tag != skip_parent_tag: 
       dump(elem) 
       # yield elem 
      root.clear() 

def other_filtered_elements(source, skip_parent_tag="feature"):    
    in_feature_tag = False 
    for event, element in et.iterparse(source, events=('start', 'end')): 
     if element.tag == skip_parent_tag: 
      in_feature_tag = event == 'start' 
     if event == 'end' and not in_feature_tag: 
      dump(element)    

test_input = """ 
<top> 
    <lev1 guff="1111"> 
     <lev2>aaaaa</lev2> 
     <lev2>bbbbb</lev2> 
    </lev1> 
    <feature> 
     feat text 1 
     <fchild>fcfcfcfc 
      <fgchild>ggggg</fgchild>  
     </fchild> 
     feat text 2 
    </feature> 
    <lev1 guff="2222"> 
     <lev2>ccccc</lev2>c-tail 
     <lev2>ddddd</lev2>d-tail 
     <notext1></notext1>e-tail 
     <notext2 />f-tail 
    </lev1>g-tail 
</top> 
""" 

print "=== me ===" 
my_filtered_elements(cStringIO.StringIO(test_input)) 
print "=== other ===" 
other_filtered_elements(cStringIO.StringIO(test_input)) 

輸出如下。您會注意到從lev1節點root.clear()不會吹走尚未完全解析的元素。這意味着,存儲器使用量是O(樹的深度),而不是O(樹中元素的總數量)

=== me === 
lev2 'aaaaa' '' {} 
lev2 'bbbbb' '' {} 
lev1 '' '' {'guff': '1111'} 
fgchild 'ggggg' '' {}   <<<=== do you want this? 
feature 'feat text 1' '' {} 
lev2 'ccccc' 'c-tail' {} 
lev2 'ddddd' 'd-tail' {} 
notext1 '' 'e-tail' {} 
notext2 '' 'f-tail' {} 
lev1 '' 'g-tail' {'guff': '2222'} 
=== other === 
lev2 'aaaaa' '' {} 
lev2 'bbbbb' '' {} 
lev1 '' '' {'guff': '1111'} 
feature 'feat text 1' '' {} 
lev2 'ccccc' 'c-tail' {} 
lev2 'ddddd' 'd-tail' {} 
notext1 '' 'e-tail' {} 
notext2 '' 'f-tail' {} 
lev1 '' 'g-tail' {'guff': '2222'} 
top '' '' {}       <<<=== do you want this? 
2

你可以用lxml來做到這一點。它有getparent()。

或者,它可能處理startend事件並跳過feature兒童cElementTree

from xml.etree import cElementTree as etree 

in_feature_tag = False 
for event, element in etree.iterparse('test.xml', events=('start', 'end')): 
    if element.tag == 'feauture': 
     in_feature_tag = event == 'start' 
    if event == 'end' and not in_feature_tag: 
     yield element 
+0

感謝您的好回答,我會盡力。但根據這裏的基準:effbot.org/zone/celementtree.htm(我不知道這是多麼值得信賴),雖然cElementTree速度更快。所以我想堅持cElementTree – 2012-02-17 17:54:23

+0

@NiekdeKlein,這段代碼和'cElementTree'一起工作(是我沒有'c'的錯字) – reclosedev 2012-02-17 17:57:24

+0

-1這樣就構建了完整的樹。 OP不希望這樣。 – 2012-02-17 19:47:41

相關問題