2016-08-07 163 views
0

我使用Python。我有100個zip文件。每個zipfile包含超過100個xmlfiles。使用xmlfiles我創建csvfiles。Python,多處理:如何優化代碼?讓代碼更快?

from xml.etree.ElementTree import fromstring 
import zipfile 
from multiprocessing import Process 

def parse_xml_for_csv1(data, writer1): 
    root = fromstring(data) 
    for node in root.iter('name'): 
     writer1.writerow(node.get('value')) 

def create_csv1(): 
    with open('output1.csv', 'w') as f1: 
     writer1 = csv.writer(f1) 

     for i in range(1, 100): 
      z = zipfile.ZipFile('xml' + str(i) + '.zip') 
      # z.namelist() contains more than 100 xml files 
      for finfo in z.namelist(): 
       data = z.read(finfo) 
       parse_xml_for_csv1(data, writer1) 


def create_csv2(): 
    with open('output2.csv', 'w') as f2: 
     writer2 = csv.writer(f2) 

     for i in range(1, 100): 
      ... 


if __name__ == "__main__": 
    p1 = Process(target=create_csv1) 
    p2 = Process(target=create_csv2) 
    p1.start() 
    p2.start() 
    p1.join() 
    p2.join() 

請告訴我,如何優化我的代碼?讓代碼更快?

+1

每個未壓縮的xml文件有多大?你正在寫的csvs? – goncalopp

+0

goncalopp,xml文件很小(約10行)。我只需要2個CSV文件。 – Olga

+0

我會使用lxml來完成處理,並儘可能在c級儘可能多地處理它http://lxml.de/FAQ.html#id1 –

回答

2

你只需要用參數定義一個方法。 在給定數量的線程或進程中拆分100個.zip文件的處理。您將添加的進程越多,您將使用的CPU越多,並且可能可以使用多於2個進程,速度會更快(由於某些點的磁盤I/O可能會出現瓶頸)

在下面的代碼中,我可以更改爲4或10個進程,無需複製/粘貼代碼。它處理不同的zip文件。

您的代碼並行處理兩個相同的100個文件:它比沒有多處理時更慢!

def create_csv(start_index,step): 
    with open('output{0}.csv'.format(start_index//step), 'w') as f1: 
     writer1 = csv.writer(f1) 

     for i in range(start_index, start_index+step): 
      z = zipfile.ZipFile('xml' + str(i) + '.zip') 
      # z.namelist() contains more than 100 xml files 
      for finfo in z.namelist(): 
       data = z.read(finfo) 
       parse_xml_for_csv1(data, writer1) 



if __name__ == "__main__": 
    nb_files = 100 
    nb_processes = 2 # raise to 4 or 8 depending on your machine 

    step = nb_files//nb_processes 
    lp = [] 
    for start_index in range(1,nb_files,step): 
     p = Process(target=create_csv,args=[start_index,step]) 
     p.start() 
     lp.append(p) 
    for p in lp: 
     p.join()