使用'xmltodict'模塊解析大型XML文件導致OverflowError

我有一個相當大的XML文件，大小約爲3GB，我想使用'xmltodict'實用程序以流模式進行解析。我已經遍歷每個項目的代碼，並形成一個字典項目並附加到內存中的字典，最終被轉儲爲json文件。使用'xmltodict'模塊解析大型XML文件導致OverflowError

，我有以下的小XML數據集完美的工作：

import xmltodict, json 
    import io 

    output = [] 

    def handle(path, item): 
     #do stuff 
     return 

    doc_file = open("affiliate_partner_feeds.xml","r") 
    doc = doc_file.read()   
    xmltodict.parse(doc, item_depth=2, item_callback=handle) 

    f = open('jbtest.json', 'w') 
    json.dump(output,f)

在一個大的文件，我得到如下：

Traceback (most recent call last): 
    File "jbparser.py", line 125, in <module> 
    **xmltodict.parse(doc, item_depth=2, item_callback=handle)** 
    File "/usr/lib/python2.7/site-packages/xmltodict.py", line 248, in parse 
    parser.Parse(xml_input, True) 
    OverflowError: size does not fit in an int

內xmltodict.py異常的具體位置是：

def parse(xml_input, encoding=None, expat=expat, process_namespaces=False, 
      namespace_separator=':', **kwargs): 

     handler = _DictSAXHandler(namespace_separator=namespace_separator, 
            **kwargs) 
     if isinstance(xml_input, _unicode): 
      if not encoding: 
       encoding = 'utf-8' 
      xml_input = xml_input.encode(encoding) 
     if not process_namespaces: 
      namespace_separator = None 
     parser = expat.ParserCreate(
      encoding, 
      namespace_separator 
     ) 
     try: 
      parser.ordered_attributes = True 
     except AttributeError: 
      # Jython's expat does not support ordered_attributes 
      pass 
     parser.StartElementHandler = handler.startElement 
     parser.EndElementHandler = handler.endElement 
     parser.CharacterDataHandler = handler.characters 
     parser.buffer_text = True 
     try: 
      parser.ParseFile(xml_input) 
     except (TypeError, AttributeError): 
      **parser.Parse(xml_input, True)** 
     return handler.item

任何方法來解決這個問題？ AFAIK，xmlparser對象不會暴露給我玩，並將'int'更改爲'long'。更重要的是，這裏究竟發生了什麼？真的很感謝這方面的任何線索。謝謝！

來源

2016-02-13 chetfaker

嘗試使用marshal.load（file）或marshal.load（sys.stdin）以反序列化文件（或將其用作流）而不是將整個文件讀入內存，然後將其解析爲整個。

這裏是一個example：

>>> def handle_artist(_, artist): 
...  print artist['name'] 
...  return True 
>>> 
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'), 
...  item_depth=2, item_callback=handle_artist) 
A Perfect Circle 
Fantômas 
King Crimson 
Chris Potter 
...

STDIN：

import sys, marshal 
while True: 
    _, article = marshal.load(sys.stdin) 
    print article['title']

來源

2016-02-13 10:39:34 MaxU

感謝輸入！ – chetfaker

使用'xmltodict'模塊解析大型XML文件導致OverflowError

回答

相關問題