許多XML文檔：如何捕獲文本的出現和最大長度

我有大量的XML文檔需要循環並捕獲子節點出現的次數和字段的最大長度。我能夠正確解析XML，並可以捕獲該字段的計數和長度。我不確定哪種數據類型和方法能夠最有效地捕獲XML文檔的分析。許多XML文檔：如何捕獲文本的出現和最大長度

我想產生的輸出是（未在該格式 - 不僅僅是數據）：

field1: count = 2, maxlength = 4 
field2: count = 2, maxlength = 1

下面是一個例子：

import xml.etree.ElementTree as ET 

#sample data: 
xml = ['<data><field1>100</field1><field2>1</field2></data>', '<data><field1>1000</field1><field2>2</field2></data>'] 

#loop to capture fields and length 
for item in xml: 
    x = ET.fromstring(item) 
    for child in x: 
     fieldname = child.tag 
     fieldlength = len(child.text) 
     print(fieldname, fieldlength)

我可以使用該計數的發生：

fields = {} 

for item in xml: 
    x = ET.fromstring(item) 
    for child in x: 
     if child.tag in fields: 
     fields[child.tag] += 1 
     else: 
     fields[child.tag] = 1

我該如何去捕獲字段出現的總數，最大長度爲(if fieldlength > maxlength then fieldlength else maxlength)？

來源

2016-08-11 mikebmassey

如果元素不存在，則應使用defaultdict解析爲您指定的默認值（請參見下面的示例）。在每次迭代那麼所有你需要做的就是：

max_length[child.tag] = max(max_length[child.tag], len(child.text))

完整的示例：

# Necessary import 
from collections import defaultdict 

# Create the default dictionary. The argument is a function that generates 
# the default value (that is returned if element is not set). int() returns  
# zero, thus we will use that, but you could as well have said   
# defaultdict(lambda : 0) 

max_length = defaultdict(int) 

for item in xml: 
    x = ET.fromstring(item) 
    for child in x: 
     # Assign max. You are fine if max_length[child.tag] do not exist yet 
     # because defaultdict will resolve it to 0. 
     max_length[child.tag] = max(max_length[child.tag], len(child.text))

來源

2016-08-11 22:00:19

許多XML文檔：如何捕獲文本的出現和最大長度

回答

相關問題