2016-05-31 16 views
2

我有一個很大的xls文件的問題。當我的應用程序添加新的統計記錄(文件末尾的新行)時,會有很長時間(一分鐘)。如果我用空的xls文件替換它,這個工作最好(1-2秒)。所以我想盡可能優化這個。使用xlwt優化在xls文件中添加行

我使用類似:

def add_stats_record(): 
    # Add record 
    lock = LockFile(STATS_FILE) 
    with lock: 
     # Open for read 
     rb = open_workbook(STATS_FILE, formatting_info=True) 
     sheet_records = rb.sheet_by_index(0) 

     # record_id 
     START_ROW = sheet_records.nrows 
     try: 
      record_id = int(sheet_records.cell(START_ROW - 1, 0).value) + 1 
     except: 
      record_id = 1 

     # Open for write 
     wb = copy(rb) 
     sheet_records = wb.get_sheet(0) 

     # Set normal style 
     style_normal = xlwt.XFStyle() 
     normal_font = xlwt.Font() 
     style_normal.font = normal_font 

     # Prepare some data here 
     ........................ 
     # then: 

     for i, col in enumerate(SHEET_RECORDS_COLS): 
      sheet_records.write(START_ROW, i, possible_values.get(col[0], ''), 
           style_normal) 

     wb.save(STATS_FILE) 

你在這裏看到什麼改進?或者你能給我一個更好的主意/例子如何做到這一點?

+0

謝謝,尼卡。這裏的解決方案對我來說很重要,因爲如果我沒有解決方案,我必須使用csv文件或其他方法重新實現功能。 – GhitaB

+0

你能給我們提供更多的信息嗎? Excel工作表的大概大小是多少?什麼樣的數據? –

+0

30000-40000行。簡單的文本:字符串和數字。 – GhitaB

回答

3

可能不是您想要聽到的答案,但幾乎沒有什麼可以優化的。

import xlwt, xlrd 
from xlutils.copy import copy as copy 
from time import time 

def add_stats_record(): 
    #Open for read 
    start_time = time() 
    rb = xlrd.open_workbook(STATS_FILE, formatting_info=True) 
    sheet_records_original = rb.sheet_by_index(0) 
    print('Elapsed time for opening:   %.2f' % (time()-start_time)) 
    #Record_id 
    start_time = time() 
    START_ROW = sheet_records_original.nrows 
    SHEET_RECORDS_COLS = sheet_records_original.ncols 
    try: 
     record_id = int(sheet_records.cell(START_ROW - 1, 0).value) + 1 
    except: 
     record_id = 1 
    print('Elapsed time for record ID:   %.2f' % (time()-start_time)) 
    #Open for write 
    start_time = time() 
    wb = copy(rb) 
    sheet_records = wb.get_sheet(0) 
    print('Elapsed time for write:    %.2f' % (time()-start_time)) 
    #Set normal style 
    style_normal = xlwt.XFStyle() 
    normal_font = xlwt.Font() 
    style_normal.font = normal_font 

    #Read all the data and get some stats 
    start_time = time() 
    max_col = {} 
    start_time = time() 
    for col_idx in range(0,16): 
     max_value = 0 
     for row_idx in range(START_ROW): 
      if sheet_records_original.cell(row_idx, col_idx).value: 
       val = float(sheet_records_original.cell(row_idx, col_idx).value) 
       if val > max_value: 
        max_col[col_idx] = str(row_idx) + ';' + str(col_idx) 

    text_cells = [[0 for x in range(15)] for y in range(START_ROW)] 
    for col_idx in range(16,31): 
     max_value = 0 
     for row_idx in range(START_ROW): 
      if sheet_records_original.cell(row_idx, col_idx).value: 
       val = str(sheet_records_original.cell(row_idx, col_idx).value).replace('text', '').count(str(col_idx)) 
       if val > max_value: 
        max_col[col_idx] = str(row_idx) + ';' + str(col_idx) 
    print('Elapsed time for reading data/stats: %.2f' % (time()-start_time)) 
    #Write the stats row 
    start_time = time() 
    for i in range(SHEET_RECORDS_COLS): 
     sheet_records.write(START_ROW, i, max_col[i], style_normal) 

    start_time = time() 
    wb.save(STATS_FILE) 
    print('Elapsed time for writing:   %.2f' % (time()-start_time))  

if __name__ == '__main__': 
    STATS_FILE = 'output.xls' 
    start_time2 = time() 
    add_stats_record() 
    print ('Total time:       %.2f' % (time() - start_time2)) 

經過時間爲開口:2.43
耗時間記錄ID:0.00
經過時間爲寫:7.62
經過時間的數據讀取/統計:2.35
經過時間寫入: 3.33
總時間:15.75

從這些結果中可以看出,你的代碼幾乎沒有任何改進的餘地。打開/複製/寫入彌補了大部分時間,但只是簡單的撥打xlrd/xlwt。使用on_demand=Trueopen_workbook也沒有幫助。使用openpyxl不會改善性能。

from openpyxl import load_workbook 
from time import time 

#Load workbook 
start_time = time() 
wb = load_workbook('output.xlsx') 
print('Elapsed time for loading workbook: %.2f' % (time.time()-start_time))  

#Read all data 
start_time = time() 
ws = wb.active 
cell_range1 = ws['A1':'P20001'] 
cell_range2 = ws['Q1':'AF20001'] 
print('Elapsed time for reading workbook: %.2f' % (time.time()-start_time))  

#Save to a new workbook 
start_time = time() 
wb.save("output_tmp.xlsx") 
print('Elapsed time for saving workbook: %.2f' % (time.time()-start_time))  

耗時間加載工作簿:22.35
經過時間用於閱讀工作簿:0.00
經過時間保存工作簿:21.11

的Ubuntu 14.04(虛擬機)/Python2.7 -64位/常規硬盤(與原生Windows 10類似的結果,Python 3在加載時表現更差,但寫入更好)。使用熊貓和numpy的

import pandas as pd 
import numpy as np 
#just random numbers 
df = pd.DataFrame(np.random.rand(20000,30), columns=range(0,30)) 
#convert half the columns to text 
for i in range(15,30): 
    df[i].apply(str) 
    df[i] = 'text' + df[i].astype(str) 
writer = pd.ExcelWriter(STATS_FILE) 
df.to_excel(writer,'Sheet1') 
writer.save() 

後一些與multiprocessing擺弄我發現了一個稍微改善溶液產生


隨機數據。由於copy操作是最耗時的操作並且共享workbook使得性能變差,所以採取了不同的方法。兩個線程都讀取原始工作簿,讀取數據,計算統計數據並將它們寫入文件(tmp.txt),另一個線程複製工作簿,等待統計信息文件出現,然後將其寫入新複製的工作簿。

差異:總共需要12%的時間(兩個腳本n = 3)。不是很好,但我不能想到另一種方式,除非不使用Excel文件。

xls_copy.py

def xls_copy(STATS_FILE, START_ROW, style_normal): 
    from xlutils.copy import copy as copy 
    from time import sleep, time 
    from os import stat 
    from xlrd import open_workbook 
    print('started 2nd thread') 
    start_time = time() 
    rb = open_workbook(STATS_FILE, formatting_info=True) 
    wb = copy(rb) 
    sheet_records = wb.get_sheet(0) 
    print('2: Elapsed time for xls_copy:   %.2f' % (time()-start_time)) 

    counter = 0 
    filesize = stat('tmp.txt').st_size 

    while filesize == 0 and counter < 10**5: 
     sleep(0.01) 
     filesize = stat('tmp.txt').st_size 
     counter +=1 
    with open('tmp.txt', 'r') as f: 
     for line in f.readlines(): 
      cells = line.split(';') 
      sheet_records.write(START_ROW, int(cells[0]), cells[1], style_normal) 

    start_time = time() 
    wb.save('tmp_' + STATS_FILE) 
    print('2: Elapsed time for writing:   %.2f' % (time()-start_time))  

xlsx_multi.py

from xls_copy import xls_copy 
import xlwt, xlrd 
from time import time 
from multiprocessing import Process 

def add_stats_record(): 

    #Open for read 
    start_time = time() 
    rb = xlrd.open_workbook(STATS_FILE, formatting_info=True) 
    sheet_records_original = rb.sheet_by_index(0) 
    print('Elapsed time for opening:   %.2f' % (time()-start_time)) 
    #Record_id 
    start_time = time() 
    START_ROW = sheet_records_original.nrows 
    f = open('tmp.txt', 'w') 
    f.close() 
    #Set normal style 
    style_normal = xlwt.XFStyle() 
    normal_font = xlwt.Font() 
    style_normal.font = normal_font 

    #start 2nd thread 
    p = Process(target=xls_copy, args=(STATS_FILE, START_ROW, style_normal,)) 
    p.start() 
    print('continuing with 1st thread') 
    SHEET_RECORDS_COLS = sheet_records_original.ncols 
    try: 
     record_id = int(sheet_records.cell(START_ROW - 1, 0).value) + 1 
    except: 
     record_id = 1 
    print('Elapsed time for record ID:   %.2f' % (time()-start_time)) 

    #Read all the data and get some stats 
    start_time = time() 
    max_col = {} 
    start_time = time() 
    for col_idx in range(0,16): 
     max_value = 0 
     for row_idx in range(START_ROW): 
      if sheet_records_original.cell(row_idx, col_idx).value: 
       val = float(sheet_records_original.cell(row_idx, col_idx).value) 
       if val > max_value: 
        max_col[col_idx] = str(row_idx) + ';' + str(col_idx) 

    text_cells = [[0 for x in range(15)] for y in range(START_ROW)] 
    for col_idx in range(16,31): 
     max_value = 0 
     for row_idx in range(START_ROW): 
      if sheet_records_original.cell(row_idx, col_idx).value: 
       val = str(sheet_records_original.cell(row_idx, col_idx).value).replace('text', '').count(str(col_idx)) 
       if val > max_value: 
        max_col[col_idx] = str(row_idx) + ';' + str(col_idx) 
    #write statistics to a temp file 
    with open('tmp.txt', 'w') as f: 
     for k in max_col: 
      f.write(str(k) + ';' + max_col[k] + str('\n')) 
    print('Elapsed time for reading data/stats: %.2f' % (time()-start_time)) 
    p.join() 
if __name__ == '__main__': 

    done = False 
    wb = None 
    STATS_FILE = 'output.xls' 
    start_time2 = time() 
    add_stats_record() 
    print ('Total time:       %.2f' % (time() - start_time2)) 
相關問題