如何在Python中快速打開excel文件？

我現在用PyExcelerator讀取Excel文件，但它是非常緩慢。由於我總是需要打開超過100MB的excel文件，因此只需加載一個文件需要20多分鐘。如何在Python中快速打開excel文件？

我需要的功能是：

打開Excel文件，選擇特定的表，並加載它們進入一個字典或列表對象。
有時：選擇特定列並僅加載具有特定值的特定列的整個行。
讀取密碼保護的Excel文件。

，我現在使用的代碼是：

book = pyExcelerator.parse_xls(filepath) 
parsed_dictionary = defaultdict(lambda: '', book[0][1]) 
number_of_columns = 44 
result_list = [] 
number_of_rows = 500000 
for i in range(0, number_of_rows): 
    ok = False 
    result_list.append([]) 
    for h in range(0, number_of_columns): 
     item = parsed_dictionary[i,h] 
     if type(item) is StringType or type(item) is UnicodeType: 
      item = item.replace("\t","").strip() 
     result_list[i].append(item) 
     if item != '': 
      ok = True 
    if not ok: 
     break

有什麼建議？

來源

2011-05-03 Felix Yan

你嘗試過其他圖書館了嗎？（我沒有關於這個主題的技術知識，我只是感興趣） – Trufa 2011-05-03 04:20:20

是的，我嘗試過，但這些總是沒有寫xls的功能。閱讀完大xlses後，我必須做一些計算並將結果保存到一個小的xls中。 – 2011-05-03 04:24:27

@ FelixYan：好的很高興知道，希望你能得到一些很好的答案！ – Trufa 2011-05-03 04:30:34

pyExcelerator似乎不會被維護。要編寫xls文件，請使用xlwt，它是pyExcelerator的一個分支，具有錯誤修復和許多增強功能。 pyExcelerator的（非常基本的）xls讀取功能已從xlwt中消除。要讀取xls文件，請使用xlrd。

如果它採取20分鐘加載一個100MB的XLS文件，你必須使用一個或多個：一個緩慢的電腦，用很少的可用內存，或更舊版本的Python的計算機。

無論pyExcelerator也不xlrd讀密碼保護的文件。

這裏的a link that covers xlrd and xlwt。

免責聲明：我xlrd和xlwt的維護者的作者。

來源

2011-05-03 04:34:09

謝謝，我會試試這兩個。事實上，我正在使用帶有4G RAM的AMD Phenom II X4 945，並且在x86_64 Linux操作系統中，其中有2G或更多是免費的，SSD和Python 2.7。閱讀過程在其他地方可能會更慢。 – 2011-05-03 04:48:20

xlrd是相當不錯的閱讀文件和xlwt是寫作很不錯。根據我的經驗，這兩者都優於pyExcelerator。

來源

2011-05-03 04:19:21 spulec

你可以嘗試在單個語句列表預分配給它的大小，而不是在這樣的時刻追加一個項目：（一個大分配的內存應該比許多小的更快）

book = pyExcelerator.parse_xls(filepath) 
parsed_dictionary = defaultdict(lambda: '', book[0][1]) 
number_of_columns = 44 
number_of_rows = 500000 
result_list = [] * number_of_rows 
for i in range(0, number_of_rows): 
    ok = False 
    #result_list.append([]) 
    for h in range(0, number_of_columns): 
     item = parsed_dictionary[i,h] 
     if type(item) is StringType or type(item) is UnicodeType: 
      item = item.replace("\t","").strip() 
     result_list[i].append(item) 
     if item != '': 
      ok = True 
    if not ok: 
     break

如果這樣做會顯着提高性能，您也可以嘗試預先分配每個列表項的列數，然後按索引分配它們，而不是一次追加一個值。下面是在單個語句的初始值爲0創建一個10×10，二維列表中的片段：

L = [[0] * 10 for i in range(10)]

所以摺疊成你的代碼，它可能工作是這樣的：

book = pyExcelerator.parse_xls(filepath) 
parsed_dictionary = defaultdict(lambda: '', book[0][1]) 
number_of_columns = 44 
number_of_rows = 500000 
result_list = [[''] * number_of_rows for x in range(number_of_columns)] 
for i in range(0, number_of_rows): 
    ok = False 
    #result_list.append([]) 
    for h in range(0, number_of_columns): 
     item = parsed_dictionary[i,h] 
     if type(item) is StringType or type(item) is UnicodeType: 
      item = item.replace("\t","").strip() 
     result_list[i,h] = item 
     if item != '': 
      ok = True 
    if not ok: 
     break

來源

2011-05-03 04:33:01

問題是，我不知道xls文件的大小。所以'number_of_rows'變量只是我猜想的最大尺寸。那麼...預分配會佔用太多內存嗎？ – 2011-05-03 04:40:28

你知道列數但不知道行數嗎？列數是否固定？無論如何，這可能值得一試。做兩個不同算法的子集的性能比較，比如1000行。你可以從那裏測量。 – 2011-05-03 04:43:28

謝謝。當然值得一試：P。你說得對，列數是固定的，我不知道行數。 – 2011-05-03 04:44:17

無關你的問題：如果你想檢查，如果沒有任何列是空字符串，然後設置ok = True開始，併爲此在代替內環（ok = ok and item != ''）。此外，您可以使用isinstance(item, basestring)來測試變量是否爲字符串。

修訂版

for i in range(0, number_of_rows): 
    ok = True 
    result_list.append([]) 
    for h in range(0, number_of_columns): 
     item = parsed_dictionary[i,h] 
     if isinstance(item, basestring): 
      item = item.replace("\t","").strip() 
     result_list[i].append(item) 
     ok = ok and item != '' 

    if not ok: 
     break

來源

2011-05-03 04:41:40 Imran

謝謝！我一直不習慣'type（item）是StringType還是type（item）是UnicodeType'的東西這麼久！但我不認爲'ok = ok和item！='''事後容易閱讀，只有一點點hacky :) – 2011-05-03 05:03:27

如何在Python中快速打開excel文件？

回答

相關問題