2016-08-11 27 views
0

我有大量的XML文檔需要循環並捕獲子節點出現的次數和字段的最大長度。我能夠正確解析XML,並可以捕獲該字段的計數和長度。我不確定哪種數據類型和方法能夠最有效地捕獲XML文檔的分析。許多XML文檔:如何捕獲文本的出現和最大長度

我想產生的輸出是(未在該格式 - 不僅僅是數據):

field1: count = 2, maxlength = 4 
field2: count = 2, maxlength = 1 

下面是一個例子:

import xml.etree.ElementTree as ET 

#sample data: 
xml = ['<data><field1>100</field1><field2>1</field2></data>', '<data><field1>1000</field1><field2>2</field2></data>'] 

#loop to capture fields and length 
for item in xml: 
    x = ET.fromstring(item) 
    for child in x: 
     fieldname = child.tag 
     fieldlength = len(child.text) 
     print(fieldname, fieldlength) 

我可以使用該計數的發生:

fields = {} 

for item in xml: 
    x = ET.fromstring(item) 
    for child in x: 
     if child.tag in fields: 
     fields[child.tag] += 1 
     else: 
     fields[child.tag] = 1 

我該如何去捕獲字段出現的總數,最大長度爲(if fieldlength > maxlength then fieldlength else maxlength)

回答

1

如果元素不存在,則應使用defaultdict解析爲您指定的默認值(請參見下面的示例)。在每次迭代那麼所有你需要做的就是:

max_length[child.tag] = max(max_length[child.tag], len(child.text)) 

完整的示例:

# Necessary import 
from collections import defaultdict 

# Create the default dictionary. The argument is a function that generates 
# the default value (that is returned if element is not set). int() returns  
# zero, thus we will use that, but you could as well have said   
# defaultdict(lambda : 0) 

max_length = defaultdict(int) 

for item in xml: 
    x = ET.fromstring(item) 
    for child in x: 
     # Assign max. You are fine if max_length[child.tag] do not exist yet 
     # because defaultdict will resolve it to 0. 
     max_length[child.tag] = max(max_length[child.tag], len(child.text)) 
相關問題