我正在寫一個自定義的文件系統爬蟲,它通過了sys.stdin來處理數以百萬計的球。我發現,在運行腳本時,隨着時間的推移,其內存使用量會大量增加,整個事情幾乎都會停止。我寫了一個小問題,下面顯示了這個問題。我做錯了什麼,或者我在Python/glob模塊中發現了一個錯誤? (我正在使用python 2.5.2)。爲什麼我用這個python循環泄漏內存?
#!/usr/bin/env python
import glob
import sys
import gc
previous_num_objects = 0
for count, line in enumerate(sys.stdin):
glob_result = glob.glob(line.rstrip('\n'))
current_num_objects = len(gc.get_objects())
new_objects = current_num_objects - previous_num_objects
print "(%d) This: %d, New: %d, Garbage: %d, Collection Counts: %s"\
% (count, current_num_objects, new_objects, len(gc.garbage), gc.get_count())
previous_num_objects = current_num_objects
輸出看起來像:
(0) This: 4042, New: 4042, Python Garbage: 0, Python Collection Counts: (660, 5, 0) (1) This: 4061, New: 19, Python Garbage: 0, Python Collection Counts: (90, 6, 0) (2) This: 4064, New: 3, Python Garbage: 0, Python Collection Counts: (127, 6, 0) (3) This: 4067, New: 3, Python Garbage: 0, Python Collection Counts: (130, 6, 0) (4) This: 4070, New: 3, Python Garbage: 0, Python Collection Counts: (133, 6, 0) (5) This: 4073, New: 3, Python Garbage: 0, Python Collection Counts: (136, 6, 0) (6) This: 4076, New: 3, Python Garbage: 0, Python Collection Counts: (139, 6, 0) (7) This: 4079, New: 3, Python Garbage: 0, Python Collection Counts: (142, 6, 0) (8) This: 4082, New: 3, Python Garbage: 0, Python Collection Counts: (145, 6, 0) (9) This: 4085, New: 3, Python Garbage: 0, Python Collection Counts: (148, 6, 0)
每第100次迭代,100個對象被釋放,所以len(gc.get_objects()
增加了200每100次迭代。 len(gc.garbage)
決不會從0開始變化。第2代收集計數緩慢增加,而第0和第1個計數增加和減少。
此積累了很多未收集的對象。但是,這並沒有停下來,是嗎?你能製作一個類似的小腳本,實際上是爬行停下來的嗎? – 2010-02-02 12:51:24