我試圖通過iterparse()
做一個XML文檔的增量解析,它被設計成太大而不適合內存。我發現即使是一個沒有操作的文檔也會耗盡進程內存,並導致我的系統開始交換。xml.etree.ElementTree.iterparse()對於大型XML文檔不可擴展?
預計xml.etree.ElementTree.iterparse()
在獨立於XML文檔大小的常量內存中運行是錯誤的嗎?如果是這樣,那麼對增量解析任意長的XML文檔的建議包是什麼?如果沒有,WTF與我的代碼有誤?
下面是代碼: 注意,我要求「開始」事件只(所以解析器不會嘗試在我的情況下返回文檔的根元素(<OSM>結束標記之前緩衝了所有的body元素)和我明確del()
循環變量迫使他們釋放。
思考的是,垃圾收集器可能不會得到運行機會,因爲循環不會屈服,我添加顯式調用gc.collect()
和time.sleep()
每但它沒有幫助
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
import pprint
import gc
import time
import os
import psutil
def gcStats(myProc):
# return human readable gc.stats for 3 generations
extmem = myProc.memory_info_ex()
a = "extmem: rss {:12n}, vms {:12n}, shared{:12n}, text{:12n}, lib {:12n}, data{:12n}, dirty{:12n}".format(
extmem.rss, extmem.vms, extmem.shared, extmem.text, extmem.lib, extmem.data, extmem.dirty)
return a + "\tgc enabled {}, sumCount {:n}, lenGarbage {:n}".format(gc.isenabled(), sum(gc.get_count()), len(gc.garbage))
# the misbehaving code:
def count_tags(filename):
retVal = {}
iterCount = 0
sleepTime = 2.0
myProc = psutil.Process()
print("Starting: gc.isenabled() == {}\n{}".format(gc.isenabled(), gcStats(myProc)))
for event, element in ET.iterparse(filename, ('start',)):
assert event == 'start'
if iterCount % 1000000 == 0:
print('{} iterations, sleeping {} sec...'.format(iterCount, sleepTime))
time.sleep(sleepTime)
print('{}\nNow starting gc pass...'.format(gcStats(myProc)))
gcr = gc.collect()
print('gc returned {}'.format(gcr))
iterCount += 1
del element
del event
return retVal
if __name__ == "__main__":
tags = count_tags('/home/bobhy/MOOC_Data/' + 'chicago.osm')
以下是文檔的示例。它是格式良好的OSM數據。
<?xml version='1.0' encoding='UTF-8'?>
<osm version="0.6" generator="Osmosis 0.43.1">
<bounds minlon="-88.50500" minlat="41.33900" maxlon="-87.06600" maxlat="42.29700" origin="http://www.openstreetmap.org/api/0.6"/>
<node id="219850" version="54" timestamp="2011-04-06T05:17:15Z" uid="207745" user="NE2" changeset="7781188" lat="41.7585879" lon="-87.9101245">
<tag k="exit_to" v="Joliet Road"/>
<tag k="highway" v="motorway_junction"/>
<tag k="ref" v="276C"/>
</node>
<node id="219851" version="47" timestamp="2011-04-06T05:18:47Z" uid="207745" user="NE2" changeset="7781188" lat="41.7593116" lon="-87.9076432">
<tag k="exit_to" v="North I-294 ; Tri-State Tollway; Wisconsin"/>
<tag k="highway" v="motorway_junction"/>
<tag k="ref" v="277A"/>
</node>
<node id="219871" version="1" timestamp="2006-04-15T00:34:03Z" uid="229" user="LA2" changeset="3725" lat="41.932278" lon="-87.9179332"/>
<node id="700724" version="14" timestamp="2009-04-13T11:21:51Z" uid="18480" user="nickvet419" changeset="485405" lat="41.7120272" lon="-88.0158606"/>
。 。 。 1.8 GB等等。 。 。
<relation id="3366425" version="1" timestamp="2013-12-07T21:37:35Z" uid="239998" user="Sundance" changeset="19330301">
<member type="way" ref="250651738" role="outer"/>
<member type="way" ref="250651748" role="inner"/>
<tag k="type" v="multipolygon"/>
</relation>
<relation id="3378994" version="1" timestamp="2013-12-14T22:24:26Z" uid="371121" user="AndrewSnow" changeset="19456337">
<member type="way" ref="251850076" role="outer"/>
<member type="way" ref="251850073" role="inner"/>
<member type="way" ref="251850074" role="inner"/>
<member type="way" ref="251850075" role="inner"/>
<tag k="type" v="multipolygon"/>
</relation>
<relation id="3382796" version="1" timestamp="2013-12-17T03:21:18Z" uid="567034" user="Umbugbene" changeset="19492258">
<member type="way" ref="252225400" role="outer"/>
<member type="way" ref="252225404" role="inner"/>
<tag k="type" v="multipolygon"/>
</relation>
</osm>
這裏是輸出:
Starting: gc.isenabled() == True
extmem: rss 9097216, vms 37199872, shared 3145728, text 3301376, lib 0, data 5820416, dirty 0 gc enabled True, sumCount 410, lenGarbage 0
0 iterations, sleeping 2.0 sec...
extmem: rss 9097216, vms 37335040, shared 3145728, text 3301376, lib 0, data 5955584, dirty 0 gc enabled True, sumCount 87, lenGarbage 0
Now starting gc pass...
gc returned 0
1000000 iterations, sleeping 2.0 sec...
extmem: rss 1234309120, vms 1262891008, shared 3280896, text 3301376, lib 0, data 1231511552, dirty 0 gc enabled True, sumCount 372, lenGarbage 0
Now starting gc pass...
gc returned 0
2000000 iterations, sleeping 2.0 sec...
extmem: rss 2495262720, vms 2524073984, shared 3280896, text 3301376, lib 0, data 2492694528, dirty 0 gc enabled True, sumCount 37, lenGarbage 0
Now starting gc pass...
gc returned 0
3000000 iterations, sleeping 2.0 sec...
extmem: rss 3781947392, vms 3812208640, shared 3280896, text 3301376, lib 0, data 3780829184, dirty 0 gc enabled True, sumCount 262, lenGarbage 0
Now starting gc pass...
gc returned 0
4000000 iterations, sleeping 2.0 sec...
extmem: rss 5067837440, vms 5096787968, shared 3280896, text 3301376, lib 0, data 5065408512, dirty 0 gc enabled True, sumCount 241, lenGarbage 0
Now starting gc pass...
gc returned 0
5000000 iterations, sleeping 2.0 sec...
extmem: rss 6345998336, vms 6375632896, shared 3063808, text 3301376, lib 0, data 6344253440, dirty 0 gc enabled True, sumCount 333, lenGarbage 0
Now starting gc pass...
gc returned 0
6000000 iterations, sleeping 2.0 sec...
extmem: rss 7266795520, vms 7665147904, shared 1060864, text 3301376, lib 0, data 7633768448, dirty 0 gc enabled True, sumCount 877, lenGarbage 0
Now starting gc pass...
予解釋輸出到顯示處理虛擬存儲器由約1 000 B /迭代生長(即,每XML標籤解析)。我認爲垃圾收集統計數據並未顯示分配對象的單調增加,所以我不知道內存增長來自哪裏。垃圾收集確實啓用。