我正在研究一個腳本,它可以擦除數千個不同的網頁。由於這些網頁通常是不同的(有不同的網站),我使用多線程來加快抓取速度。內存分配失敗:增長緩衝區 - Python
編輯:SIMPLE簡短說明
-------
我加載300周的網址(HTMLS)在300名員工中的一個池。由於html的大小是可變的,有時候,大小的總和可能太大,python會增加:internal buffer error : Memory allocation failed : growing buffer
。我想以某種方式檢查是否會發生這種情況,如果等到緩衝區未滿。
-------
這種方法的工作,但有時,蟒蛇開始拋出:
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
到控制檯。我想,那是因爲html
大小我存儲在內存中,它可以是300 *(如1MB)= 300MB
編輯:
我知道,我可以減少工人數量,我會。但這不是一個解決方案,只有較低的機會才能得到這樣的錯誤。我想避免這種錯誤在所有...
我開始登錄html
尺寸:
ram_logger.debug('SIZE: {}'.format(sys.getsizeof(html)))
,結果是(部分):
2017-03-05 13:02:04,914 DEBUG SIZE: 243940
2017-03-05 13:02:05,023 DEBUG SIZE: 138384
2017-03-05 13:02:05,026 DEBUG SIZE: 1185964
2017-03-05 13:02:05,141 DEBUG SIZE: 1203715
2017-03-05 13:02:05,213 DEBUG SIZE: 291415
2017-03-05 13:02:05,213 DEBUG SIZE: 287030
2017-03-05 13:02:05,224 DEBUG SIZE: 1192165
2017-03-05 13:02:05,230 DEBUG SIZE: 1193751
2017-03-05 13:02:05,234 DEBUG SIZE: 359193
2017-03-05 13:02:05,247 DEBUG SIZE: 23703
2017-03-05 13:02:05,252 DEBUG SIZE: 24606
2017-03-05 13:02:05,275 DEBUG SIZE: 302388
2017-03-05 13:02:05,329 DEBUG SIZE: 334925
這是我的簡化刮方法:
def scrape_chunk(chunk):
pool = Pool(300)
results = pool.map(scrape_chunk_item, chunk)
pool.close()
pool.join()
return results
def scrape_chunk_item(item):
root_result = _load_root(item.get('url'))
# parse using xpath and return
而加載html的函數:
def _load_root(url):
for i in xrange(settings.ENGINE_NUMBER_OF_CONNECTION_ATTEMPTS):
try:
headers = requests.utils.default_headers()
headers['User-Agent'] = ua.chrome
r = requests.get(url, timeout=(settings.ENGINE_SCRAPER_REQUEST_TIMEOUT + i, 10 + i), verify=False,)
r.raise_for_status()
except requests.Timeout as e:
if i >= settings.ENGINE_NUMBER_OF_CONNECTION_ATTEMPTS - 1:
tb = traceback.format_exc()
return {'success': False, 'root': None, 'error': 'timeout', 'traceback': tb}
except Exception:
tb = traceback.format_exc()
return {'success': False, 'root': None, 'error': 'unknown_error', 'traceback': tb}
else:
break
r.encoding = 'utf-8'
html = r.content
ram_logger.debug('SIZE: {}'.format(sys.getsizeof(html)))
try:
root = etree.fromstring(html, etree.HTMLParser())
except Exception:
tb = traceback.format_exc()
return {'success': False, 'root': None, 'error': 'root_error', 'traceback': tb}
return {'success': True, 'root': root}
你知道如何安全嗎?如果有緩衝區溢出問題,會讓工作人員等待什麼?