填充列表時Python內存泄漏 - 如何解決它？

我有一段代碼，看起來像這樣：填充列表時Python內存泄漏 - 如何解決它？

downloadsByExtensionCount = defaultdict(int) 
downloadsByExtensionList = [] 
logFiles = ['file1.log', 'file2.log', 'file3.log', 'file4.log'] 


for logFile in logFiles: 
    log = open(logFile, 'r', encoding='utf-8') 
    logLines = log.readlines() 

    for logLine in logLines: 
     date, time, c_ip, cs_method, s_port, cs_uri_stem, cs_user_agent = logLine.split(" ") 

     downloadsByExtensionCount[cs_uri_stem] += 1 
     downloadsByExtensionList.append([date, time, c_ip, cs_method, s_port, cs_uri_stem, cs_user_agent])

這四個文件是150MB左右，並且每個人有大約60 000 - 在這80萬行。

我開始使用這些文件中的一個來製作腳本，因爲我用這種方法測試功能的速度更快，但現在我擁有了所有的邏輯和功能，當然我試過在所有四個日誌文件上運行它一旦。我拿到劇本開始從第四個文件獲取數據時是這樣的：

Traceback (most recent call last): 
    File "C:\Python32\lib\codecs.py", line 300, in decode 
    (result, consumed) = self._buffer_decode(data, self.errors, final) 
MemoryError

所以 - 我看了看這個東西是消耗了多少內存，這是我發現：

腳本讀取前三個文件併到達1800-1950MB左右的某處，然後開始讀取最後一個文件增加50-100MB，然後出現錯誤。我試着用最後一行（附加）的腳本跑出註釋，然後它總共達到了大約500MB。

那麼，我做錯了什麼？這四個文件總共大約600MB，並且該腳本消耗大約1500用於填充四個文件中只有三個的列表，其中

我只是不明白爲什麼。我該如何改進？謝謝。

來源

2011-07-13 pootzko

您可以使用sqlite3內置模塊進行數據操作。您還可以提供特殊名稱「：memory：」insted「c：/ temp/example」在RAM中創建數據庫。如果沒有存儲在RAM限制是硬盤的可用空間。

import sqlite3 
from collections import defaultdict 

downloadsByExtensionCount = defaultdict(int) 
# downloadsByExtensionList = [] 
logFiles = ['file1.log', 'file2.log', 'file3.log', 'file4.log'] 


conn = sqlite3.connect('c:/temp/example') 
c = conn.cursor() 
# Create table 
c.execute('create table if not exists logs(date, time, c_ip, cs_method, s_port, cs_uri_stem, cs_user_agent)') 

for logFile in logFiles: 
    try: 
     log = open(logFile, 'rb')#, encoding='utf-8') 
    except IOError, e: 
     continue 

    logLines = log.readlines() 

    for logLine in logLines: 
     date, time, c_ip, cs_method, s_port, cs_uri_stem, cs_user_agent = logLine.split(" ") 

     downloadsByExtensionCount[cs_uri_stem] += 1 
     c.execute(
      'insert into logs(date, time, c_ip, cs_method, s_port, cs_uri_stem, cs_user_agent) values(?,?,?,?,?,?,?)', 
      (date, time, c_ip, cs_method, s_port, cs_uri_stem, cs_user_agent) 
      ) 

conn.commit() 
conn.close()

來源

2011-07-13 10:06:35 Alex

迭代直接通過文件內容：

for logFile in logFiles: 

    log = open(logFile, 'r', encoding='utf-8') 
    for logLine in log: 
     ... 
    log.close()

使用tuple，而不是list：

>>> sys.getsizeof(('1','2','3')) 
80 
>>> sys.getsizeof(['1','2','3']) 
96

來源

2011-07-13 09:49:50

log.readlines()讀取文件內容到行的列表。您可以直接迭代文件以避免出現額外的列表。

downloadsByExtensionCount = defaultdict(int) 
downloadsByExtensionList = [] 
logFiles = ['file1.log', 'file2.log', 'file3.log', 'file4.log'] 


for logFile in logFiles: 
    # closes the file after the block 
    with open(logFile, 'r', encoding='utf-8') as log: 
     # just iterate over the file 
     for logLine in log: 
      date, time, c_ip, cs_method, s_port, cs_uri_stem, cs_user_agent = logLine.split(" ") 
      downloadsByExtensionCount[cs_uri_stem] += 1 
      # tuples are enough to store the data 
      downloadsByExtensionList.append((date, time, c_ip, cs_method, s_port, cs_uri_stem, cs_user_agent))

來源

2011-07-13 09:54:03

Jochen是正確的 - 這是迭代文件的正確方法。我要補充的一件事是，如果你必須遍歷許多文件，你還應該使用'with open（filename）as f：'方法，這保證了當你完成文件句柄的處理時，垃圾收集器將關閉它。 – synthesizerpatel

使用「with」方法迭代文件時，是否需要fileName.close（）？我搜索了一下，似乎答案是否定的？ – pootzko

使用元組和方法腳本運行得非常快，速度更快，花費在200-300MB左右，但我仍然得到MemoryError .. – pootzko

填充列表時Python內存泄漏 - 如何解決它？

回答

相關問題