2011-09-11 23 views
0

我嘗試使用多處理模塊中的Pool來加速讀取大型csv文件。爲此,我修改了一個example(來自py2k),但似乎csv.dictreader對象沒有長度。這是否意味着我只能迭代它?有沒有辦法把它還原?如何在python 3.2中分塊csv(dict)閱讀器對象?

這些問題似乎有關,但並沒有真正回答我的問題: Number of lines in csv.DictReaderHow to chunk a list in Python 3?

我的代碼試圖做到這一點:

source = open('/scratch/data.txt','r') 
def csv2nodes(r): 
    strptime = time.strptime 
    mktime = time.mktime 
    l = [] 
    ppl = set() 
    for row in r: 
     cell = int(row['cell']) 
     id = int(row['seq_ei']) 
     st = mktime(strptime(row['dat_deb_occupation'],'%d/%m/%Y')) 
     ed = mktime(strptime(row['dat_fin_occupation'],'%d/%m/%Y')) 
     # collect list 
     l.append([(id,cell,{1:st,2: ed})]) 
     # collect separate sets 
     ppl.add(id) 
    return (l,ppl) 


def csv2graph(source): 
    r = csv.DictReader(source,delimiter=',') 
    MG=nx.MultiGraph() 
    l = [] 
    ppl = set() 
    # Remember that I use integers for edge attributes, to save space! Dic above. 
    # start: 1 
    # end: 2 
    p = Pool(processes=4) 
    node_divisor = len(p._pool)*4 
    node_chunks = list(chunks(r,int(len(r)/int(node_divisor)))) 
    num_chunks = len(node_chunks) 
    pedgelists = p.map(csv2nodes, 
         zip(node_chunks)) 
    ll = [] 
    for l in pedgelists: 
     ll.append(l[0]) 
     ppl.update(l[1]) 
    MG.add_edges_from(ll) 
    return (MG,ppl) 

回答

1

csv.DictReader documentation(和csv.reader類它的子類),該類將返回一個迭代器。當您撥打len()時,代碼應該拋出TypeError

您仍然可以將數據塊化,但您必須將其完全讀取到內存中。如果您關心內存,則可以從csv.DictReader切換到csv.reader,並跳過創建的字典csv.DictReader的開銷。爲了提高可讀性csv2nodes(),您可以分配常數,以解決各領域的指標:

CELL = 0 
SEQ_EI = 1 
DAT_DEB_OCCUPATION = 4 
DAT_FIN_OCCUPATION = 5 

我還建議使用不同的變量比id,因爲這是一個內置的函數名。