2014-09-23 58 views
3

我正在使用lxml在python中併發地解析多個xml文件。當我初始化該過程時,我希望我的主類在將etree對象傳遞給進程之前對XML進行一些工作,但是我發現當etree對象到達新進程時類仍然存在,但XML從在對象內部和getroot()返回None。無法將lxml etree對象傳遞給單獨的進程

我知道我可以使用隊列僅通過揀選的數據,但是這也與我傳遞給過程中的「ARGS」場內部的情況?

這裏是我的代碼:

import multiprocessing, multiprocessing.pool, time 
from lxml import etree 

def compute(tree): 
    print("Start Process") 
    print(type(tree)) # Returns <class 'lxml.etree._ElementTree'> 
    print(id(tree)) # Returns new ID 44637320 as expected 
    print(tree.getroot()) # Returns None 

def pool_init(queue): 
    # see http://stackoverflow.com/a/3843313/852994 
    compute.queue = queue 

class Main(): 
    def __init__(self): 
     pass 

    def main(self): 
     tree = etree.parse('test.xml') 
     print(id(tree)) # Returns object ID 43998536 
     print(tree.getroot()) #Returns <Element SymCLI_ML at 0x29f5dc8> 

     self.queue = multiprocessing.Queue() 
     self.pool = multiprocessing.Pool(processes=1, initializer=pool_init, initargs=(self.queue,)) 
     self.pool.apply_async(func=compute, args=(tree,)) 
     time.sleep(10) 

if __name__ == '__main__': 
    Main().main() 

任何和所有幫助非常感謝。

UPDATE/ANSWER

基礎上,在未來職位的答案下來,我已經修改了它一下,並設法得到它低得多的內存佔用的工作,而無需使用字符串IO。 etree.tostring方法返回一個字節數組,它可以被pickle,然後取消它,byte數組可以被etree分析。

import multiprocessing, multiprocessing.pool, time, copyreg 
from lxml import etree 

def compute(tree): 
    print("Start Process") 
    print(type(tree)) # Returns <class 'lxml.etree._ElementTree'> 
    print(tree.getroot()) # Returns <Element SymCLI_ML at 0x29f5dc8>. Success! 

def pool_init(queue): 
    # see http://stackoverflow.com/a/3843313/852994 
    compute.queue = queue 

def elementtree_unpickler(data): 
    return etree.parse(BytesIO(data)) 

def elementtree_pickler(tree): 
    return elementtree_unpickler, (etree.tostring(tree),) 

copyreg.pickle(etree._ElementTree, elementtree_pickler, elementtree_unpickler) 

class Main(): 
    def __init__(self): 
     pass 

    def main(self): 
     tree = etree.parse('test.xml') 
     print(tree.getroot()) #Returns <Element SymCLI_ML at 0x29f5dc8> 

     self.queue = multiprocessing.Queue() 
     self.pool = multiprocessing.Pool(processes=1, initializer=pool_init, initargs=(self.queue,)) 
     self.pool.apply_async(func=compute, args=(tree,)) 
     time.sleep(10) 

if __name__ == '__main__': 
    Main().main() 

更新2

後做一些基準標記內存我發現,通過大對象使得這些對象不能被垃圾收集的主要工藝清理。這可能不是小規模的問題,但etree對象在內存中的數量爲幾百MB。只要在語句中使用XML對象調用異步任務,即使從主進程中刪除該對象,即使手動調用垃圾回收,該對象也不能從內存中清除。因此,我已經恢復到在主進程中關閉XML並將文件名傳遞給子進程。

+0

是否有可能將etree對象放入共享內存並將共享內存位置的引用傳遞給子進程? – 2017-06-14 13:16:32

回答

4

使用以下代碼來註冊LXML元/一個ElementTree對象簡單picklers/unpickle程序。我以前用lxml和zmq來使用它。

import copy_reg 
try: 
    from cStringIO import StringIO 
except ImportError: 
    from StringIO import StringIO 
from lxml import etree 

def element_unpickler(data): 
    return etree.fromstring(data) 

def element_pickler(element): 
    data = etree.tostring(element) 
    return element_unpickler, (data,) 

copy_reg.pickle(etree._Element, element_pickler, element_unpickler) 

def elementtree_unpickler(data): 
    data = StringIO(data) 
    return etree.parse(data) 

def elementtree_pickler(tree): 
    data = StringIO() 
    tree.write(data) 
    return elementtree_unpickler, (data.getvalue(),) 

copy_reg.pickle(etree._ElementTree, elementtree_pickler, elementtree_unpickler) 
+1

我已經添加了這個(python 3.4,所以copy_reg是copyreg和StringIO導入是'從導入StringIO'),但是在啓動該過程的行上,我得到'I/O錯誤:寫入錯誤'。 完整代碼修改爲最初的問題。 – proudmatt 2014-09-23 13:23:18

+0

@Georges Marting:非常有幫助,歡呼聲。 – spinus 2015-07-31 00:34:01