2014-04-12 53 views
2

我試圖通過iterparse()做一個XML文檔的增量解析,它被設計成太大而不適合內存。我發現即使是一個沒有操作的文檔也會耗盡進程內存,並導致我的系統開始交換。xml.etree.ElementTree.iterparse()對於大型XML文檔不可擴展?

預計xml.etree.ElementTree.iterparse()在獨立於XML文檔大小的常量內存中運行是錯誤的嗎?如果是這樣,那麼對增量解析任意長的XML文檔的建議包是什麼?如果沒有,WTF與我的代碼有誤?

下面是代碼: 注意,我要求「開始」事件只(所以解析器不會嘗試在我的情況下返回文檔的根元素(<OSM>結束標記之前緩衝了所有的body元素)和我明確del()循環變量迫使他們釋放。

思考的是,垃圾收集器可能不會得到運行機會,因爲循環不會屈服,我添加顯式調用gc.collect()time.sleep()每但它沒有幫助

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import xml.etree.ElementTree as ET 
import pprint 

import gc 
import time 
import os 
import psutil 

def gcStats(myProc): 
    # return human readable gc.stats for 3 generations 

    extmem = myProc.memory_info_ex() 

    a = "extmem: rss {:12n}, vms {:12n}, shared{:12n}, text{:12n}, lib {:12n}, data{:12n}, dirty{:12n}".format(
        extmem.rss, extmem.vms, extmem.shared, extmem.text, extmem.lib, extmem.data, extmem.dirty) 

    return a + "\tgc enabled {}, sumCount {:n}, lenGarbage {:n}".format(gc.isenabled(), sum(gc.get_count()), len(gc.garbage)) 

# the misbehaving code:  
def count_tags(filename): 
    retVal = {} 
    iterCount = 0 
    sleepTime = 2.0 
    myProc = psutil.Process() 

    print("Starting: gc.isenabled() == {}\n{}".format(gc.isenabled(), gcStats(myProc))) 

    for event, element in ET.iterparse(filename, ('start',)): 
     assert event == 'start' 
     if iterCount % 1000000 == 0: 
      print('{} iterations, sleeping {} sec...'.format(iterCount, sleepTime)) 
      time.sleep(sleepTime) 
      print('{}\nNow starting gc pass...'.format(gcStats(myProc))) 
      gcr = gc.collect() 
      print('gc returned {}'.format(gcr)) 
     iterCount += 1 
     del element 
     del event 
    return retVal 


if __name__ == "__main__": 
    tags = count_tags('/home/bobhy/MOOC_Data/' + 'chicago.osm') 

以下是文檔的示例。它是格式良好的OSM數據。

<?xml version='1.0' encoding='UTF-8'?> 
<osm version="0.6" generator="Osmosis 0.43.1"> 
    <bounds minlon="-88.50500" minlat="41.33900" maxlon="-87.06600" maxlat="42.29700" origin="http://www.openstreetmap.org/api/0.6"/> 
    <node id="219850" version="54" timestamp="2011-04-06T05:17:15Z" uid="207745" user="NE2" changeset="7781188" lat="41.7585879" lon="-87.9101245"> 
    <tag k="exit_to" v="Joliet Road"/> 
    <tag k="highway" v="motorway_junction"/> 
    <tag k="ref" v="276C"/> 
    </node> 
    <node id="219851" version="47" timestamp="2011-04-06T05:18:47Z" uid="207745" user="NE2" changeset="7781188" lat="41.7593116" lon="-87.9076432"> 
    <tag k="exit_to" v="North I-294 ; Tri-State Tollway; Wisconsin"/> 
    <tag k="highway" v="motorway_junction"/> 
    <tag k="ref" v="277A"/> 
    </node> 
    <node id="219871" version="1" timestamp="2006-04-15T00:34:03Z" uid="229" user="LA2" changeset="3725" lat="41.932278" lon="-87.9179332"/> 
    <node id="700724" version="14" timestamp="2009-04-13T11:21:51Z" uid="18480" user="nickvet419" changeset="485405" lat="41.7120272" lon="-88.0158606"/> 

。 。 。 1.8 GB等等。 。 。

<relation id="3366425" version="1" timestamp="2013-12-07T21:37:35Z" uid="239998" user="Sundance" changeset="19330301"> 
    <member type="way" ref="250651738" role="outer"/> 
    <member type="way" ref="250651748" role="inner"/> 
    <tag k="type" v="multipolygon"/> 
    </relation> 
    <relation id="3378994" version="1" timestamp="2013-12-14T22:24:26Z" uid="371121" user="AndrewSnow" changeset="19456337"> 
    <member type="way" ref="251850076" role="outer"/> 
    <member type="way" ref="251850073" role="inner"/> 
    <member type="way" ref="251850074" role="inner"/> 
    <member type="way" ref="251850075" role="inner"/> 
    <tag k="type" v="multipolygon"/> 
    </relation> 
    <relation id="3382796" version="1" timestamp="2013-12-17T03:21:18Z" uid="567034" user="Umbugbene" changeset="19492258"> 
    <member type="way" ref="252225400" role="outer"/> 
    <member type="way" ref="252225404" role="inner"/> 
    <tag k="type" v="multipolygon"/> 
    </relation> 
</osm> 

這裏是輸出:

Starting: gc.isenabled() == True 
extmem: rss  9097216, vms  37199872, shared  3145728, text  3301376, lib   0, data  5820416, dirty   0 gc enabled True, sumCount 410, lenGarbage 0 
0 iterations, sleeping 2.0 sec... 
extmem: rss  9097216, vms  37335040, shared  3145728, text  3301376, lib   0, data  5955584, dirty   0 gc enabled True, sumCount 87, lenGarbage 0 
Now starting gc pass... 
gc returned 0 
1000000 iterations, sleeping 2.0 sec... 
extmem: rss 1234309120, vms 1262891008, shared  3280896, text  3301376, lib   0, data 1231511552, dirty   0 gc enabled True, sumCount 372, lenGarbage 0 
Now starting gc pass... 
gc returned 0 
2000000 iterations, sleeping 2.0 sec... 
extmem: rss 2495262720, vms 2524073984, shared  3280896, text  3301376, lib   0, data 2492694528, dirty   0 gc enabled True, sumCount 37, lenGarbage 0 
Now starting gc pass... 
gc returned 0 
3000000 iterations, sleeping 2.0 sec... 
extmem: rss 3781947392, vms 3812208640, shared  3280896, text  3301376, lib   0, data 3780829184, dirty   0 gc enabled True, sumCount 262, lenGarbage 0 
Now starting gc pass... 
gc returned 0 
4000000 iterations, sleeping 2.0 sec... 
extmem: rss 5067837440, vms 5096787968, shared  3280896, text  3301376, lib   0, data 5065408512, dirty   0 gc enabled True, sumCount 241, lenGarbage 0 
Now starting gc pass... 
gc returned 0 
5000000 iterations, sleeping 2.0 sec... 
extmem: rss 6345998336, vms 6375632896, shared  3063808, text  3301376, lib   0, data 6344253440, dirty   0 gc enabled True, sumCount 333, lenGarbage 0 
Now starting gc pass... 
gc returned 0 
6000000 iterations, sleeping 2.0 sec... 
extmem: rss 7266795520, vms 7665147904, shared  1060864, text  3301376, lib   0, data 7633768448, dirty   0 gc enabled True, sumCount 877, lenGarbage 0 
Now starting gc pass... 

予解釋輸出到顯示處理虛擬存儲器由約1 000 B /迭代生長(即,每XML標籤解析)。我認爲垃圾收集統計數據並未顯示分配對象的單調增加,所以我不知道內存增長來自哪裏。垃圾收集確實啓用。

回答

1

iterparse()的文檔的仔細閱讀使我確信以上是預期的行爲。該文檔說,它返回一個完整的元素,對子訪問沒有限制,所以它必須保留(增量增長)的文檔樹在內存中。

由於我的問題不需要父或子元素訪問,只是遇到的每個標籤的事件,我能用xml.etree.ElementTree.XMLParser()很好地解決我的問題。

2

您需要通過調用方法element.clear()明確清除不再需要的元素,否則它仍會在內存中徘徊。這意味着您可能還想要聽取'end'事件,並在您知道不再需要任何內容​​的封裝元素末尾時致電clear()