2014-02-19 77 views
6

我已經寫了一個腳本,它必須從一個文件夾(大約10,000)中讀取大量的excel文件。該腳本加載excel文件(其中一些文件有2000多行)並讀取一列來計算行數(檢查內容)。如果行數不等於給定數量,它會將警告寫入日誌。使用openpyxl和大型數據的內存錯誤擅長

問題出現在腳本讀取超過1,000個excel文件時。那麼當它拋出內存錯誤時,我不知道問題出在哪裏。以前,該腳本會讀取14,000行兩個csv文件並將其存儲在列表中。這些列表包含excel文件的標識符及其相應的行數。如果這個行數不等於excel文件的行數,它會寫入警告。可能是閱讀這些列表的問題?

我使用openpyxl加載工作簿,是否需要在打開下一個之前關閉它們?

這是我的代碼:

# -*- coding: utf-8 -*- 

import os 
from openpyxl import Workbook 
import glob 
import time 
import csv 
from time import gmtime,strftime 
from openpyxl import load_workbook 

folder = '' 
conditions = 0 
a = 0 
flight_error = 0 
condition_error = 0 
typical_flight_error = 0 
SP_error = 0 


cond_numbers = [] 
with open('Conditions.csv','rb') as csv_name:   # Abre el fichero csv donde estarán las equivalencias 
    csv_read = csv.reader(csv_name,delimiter='\t') 

    for reads in csv_read: 
     cond_numbers.append(reads) 

flight_TF = [] 
with open('vuelo-TF.csv','rb') as vuelo_TF: 
    csv_read = csv.reader(vuelo_TF,delimiter=';') 

    for reads in csv_read: 
     flight_TF.append(reads) 


excel_files = glob.glob('*.xlsx') 

for excel in excel_files: 
    print "Leyendo excel: "+excel 

    wb = load_workbook(excel) 
    ws = wb.get_sheet_by_name('Control System') 
    flight = ws.cell('A7').value 
    typical_flight = ws.cell('B7').value 
    a = 0 

    for row in range(6,ws.get_highest_row()): 
     conditions = conditions + 1 


     value_flight = int(ws.cell(row=row,column=0).value) 
     value_TF = ws.cell(row=row,column=1).value 
     value_SP = int(ws.cell(row=row,column=4).value) 

     if value_flight == '': 
      break 

     if value_flight != flight: 
      flight_error = 1    # Si no todos los flight numbers dentro del vuelo son iguales 

     if value_TF != typical_flight: 
      typical_flight_error = 2   # Si no todos los typical flight dentro del vuelo son iguales 

     if value_SP != 100: 
      SP_error = 1 



    for cond in cond_numbers: 
     if int(flight) == int(cond[0]): 
      conds = int(cond[1]) 
      if conds != int(conditions): 
       condition_error = 1   # Si el número de condiciones no se corresponde con el esperado 

    for vuelo_TF in flight_TF: 
     if int(vuelo_TF[0]) == int(flight): 
      TF = vuelo_TF[1] 
      if typical_flight != TF: 
       typical_flight_error = 1  # Si el vuelo no coincide con el respectivo typical flight 

    if flight_error == 1: 
     today = datetime.datetime.today() 
     time = today.strftime(" %Y-%m-%d %H.%M.%S") 
     log = open('log.txt','aw') 
     message = time+': Los flight numbers del vuelo '+str(flight)+' no coinciden.\n' 
     log.write(message) 
     log.close() 
     flight_error = 0 

    if condition_error == 1: 
     today = datetime.datetime.today() 
     time = today.strftime(" %Y-%m-%d %H.%M.%S") 
     log = open('log.txt','aw') 
     message = time+': El número de condiciones del vuelo '+str(flight)+' no coincide. Condiciones esperadas: '+str(int(conds))+'. Condiciones obtenidas: '+str(int(conditions))+'.\n' 
     log.write(message) 
     log.close() 
     condition_error = 0 

    if typical_flight_error == 1: 
     today = datetime.datetime.today() 
     time = today.strftime(" %Y-%m-%d %H.%M.%S") 
     log = open('log.txt','aw') 
     message = time+': El vuelo '+str(flight)+' no coincide con el typical flight. Typical flight respectivo: '+TF+'. Typical flight obtenido: '+typical_flight+'.\n' 
     log.write(message) 
     log.close() 
     typical_flight_error = 0 

    if typical_flight_error == 2: 
     today = datetime.datetime.today() 
     time = today.strftime(" %Y-%m-%d %H.%M.%S") 
     log = open('log.txt','aw') 
     message = time+': Los typical flight del vuelo '+str(flight)+' no son todos iguales.\n' 
     log.write(message) 
     log.close() 
     typical_flight_error = 0 

    if SP_error == 1: 
     today = datetime.datetime.today() 
     time = today.strftime(" %Y-%m-%d %H.%M.%S") 
     log = open('log.txt','aw') 
     message = time+': Hay algún Step Percentage del vuelo '+str(flight)+' menor que 100.\n' 
     log.write(message) 
     log.close() 
     SP_error = 0 

    conditions = 0 

的,如果最終的語句是檢查和書面警告日誌。

我使用的Windows XP與8 GB內存和英特爾至強W3505(雙核,2,53 GHz)。

回答

9

openpyxl的默認實現將所有訪問的單元存儲到內存中。我會建議你使用優化的閱讀器(鏈接 - https://openpyxl.readthedocs.org/en/latest/optimized.html),而不是

在代碼: -

wb = load_workbook(file_path, use_iterators = True) 

在加載工作簿通過use_iterators = True。然後訪問片和細胞如:​​

for row in sheet.iter_rows(): 
    for cell in row: 
     cell_text = cell.value 

這將內存佔用減少到5-10%

UPDATE:在版本2.4.0 use_iterators = True選項被去除。在較新版本中,openpyxl.writer.write_only.WriteOnlyWorksheet被引入用於傾銷大量數據。

from openpyxl import Workbook 
wb = Workbook(write_only=True) 
ws = wb.create_sheet() 

# now we'll fill it with 100 rows x 200 columns 
for irow in range(100): 
    ws.append(['%d' % i for i in range(200)]) 

# save the file 
wb.save('new_big_file.xlsx') 

未測試從以上鍊接複製的以下代碼。

感謝@SdaliM的信息。

+1

此選項似乎不存在了( openpyxl 2.4.1)。您提供的鏈接沒有提及這樣的選項。也許你知道一個替代品? – SdaliM