1

我正在研究一個腳本,它可以擦除數千個不同的網頁。由於這些網頁通常是不同的(有不同的網站),我使用多線程來加快抓取速度。內存分配失敗:增長緩衝區 - Python

編輯:SIMPLE簡短說明

-------

我加載300周的網址(HTMLS)在300名員工中的一個池。由於html的大小是可變的,有時候,大小的總和可能太大,python會增加:internal buffer error : Memory allocation failed : growing buffer。我想以某種方式檢查是否會發生這種情況,如果等到緩衝區未滿。

-------

這種方法的工作,但有時,蟒蛇開始拋出:

internal buffer error : Memory allocation failed : growing buffer 
internal buffer error : Memory allocation failed : growing buffer 
internal buffer error : Memory allocation failed : growing buffer 
internal buffer error : Memory allocation failed : growing buffer 
internal buffer internal buffer error : Memory allocation failed : growing buffer 
internal buffer error : Memory allocation failed : growing buffer 
error : Memory allocation failed : growing buffer 
internal buffer error : Memory allocation failed : growing buffer 
internal buffer error : Memory allocation failed : growing buffer 
internal buffer error : Memory allocation failed : growing buffer 

到控制檯。我想,那是因爲html大小我存儲在內存中,它可以是300 *(如1MB)= 300MB

編輯:

我知道,我可以減少工人數量,我會。但這不是一個解決方案,只有較低的機會才能得到這樣的錯誤。我想避免這種錯誤在所有...

我開始登錄html尺寸:

ram_logger.debug('SIZE: {}'.format(sys.getsizeof(html))) 

,結果是(部分):

2017-03-05 13:02:04,914 DEBUG SIZE: 243940 
2017-03-05 13:02:05,023 DEBUG SIZE: 138384 
2017-03-05 13:02:05,026 DEBUG SIZE: 1185964 
2017-03-05 13:02:05,141 DEBUG SIZE: 1203715 
2017-03-05 13:02:05,213 DEBUG SIZE: 291415 
2017-03-05 13:02:05,213 DEBUG SIZE: 287030 
2017-03-05 13:02:05,224 DEBUG SIZE: 1192165 
2017-03-05 13:02:05,230 DEBUG SIZE: 1193751 
2017-03-05 13:02:05,234 DEBUG SIZE: 359193 
2017-03-05 13:02:05,247 DEBUG SIZE: 23703 
2017-03-05 13:02:05,252 DEBUG SIZE: 24606 
2017-03-05 13:02:05,275 DEBUG SIZE: 302388 
2017-03-05 13:02:05,329 DEBUG SIZE: 334925 

這是我的簡化刮方法:

def scrape_chunk(chunk): 
    pool = Pool(300) 
    results = pool.map(scrape_chunk_item, chunk) 
    pool.close() 
    pool.join() 
    return results 

def scrape_chunk_item(item): 
    root_result = _load_root(item.get('url')) 
    # parse using xpath and return 

而加載html的函數:

def _load_root(url): 
    for i in xrange(settings.ENGINE_NUMBER_OF_CONNECTION_ATTEMPTS): 
     try: 
      headers = requests.utils.default_headers() 
      headers['User-Agent'] = ua.chrome 
      r = requests.get(url, timeout=(settings.ENGINE_SCRAPER_REQUEST_TIMEOUT + i, 10 + i), verify=False,) 
      r.raise_for_status() 
     except requests.Timeout as e: 

      if i >= settings.ENGINE_NUMBER_OF_CONNECTION_ATTEMPTS - 1: 
       tb = traceback.format_exc() 
       return {'success': False, 'root': None, 'error': 'timeout', 'traceback': tb} 
     except Exception: 
      tb = traceback.format_exc() 
      return {'success': False, 'root': None, 'error': 'unknown_error', 'traceback': tb} 
     else: 
      break 

    r.encoding = 'utf-8' 
    html = r.content 
    ram_logger.debug('SIZE: {}'.format(sys.getsizeof(html))) 
    try: 
     root = etree.fromstring(html, etree.HTMLParser()) 
    except Exception: 
     tb = traceback.format_exc() 
     return {'success': False, 'root': None, 'error': 'root_error', 'traceback': tb} 

    return {'success': True, 'root': root} 

你知道如何安全嗎?如果有緩衝區溢出問題,會讓工作人員等待什麼?

回答

1

您可以限制每個工人只開始,如果有可用X內存... 沒有測試

lock = threading.Lock() 
total_mem= 1024 * 1024 * 500 #500MB spare memory 
@contextlib.contextmanager 
def ensure_memory(size): 
    global total_mem 
    while 1: 
     with lock: 
      if total_mem > size: 
       total_mem-= size 
       break 
     time.sleep(1) #or something else... 
    yield 
    with lock: 
     total_mem += size 

def _load_root(url): 
    ... 
    r = requests.get(url, timeout=(settings.ENGINE_SCRAPER_REQUEST_TIMEOUT + i, 10 + i), verify=False, stream=True) #add the stream=True to make request wait on on loading the entire request 
    ... 
    with ensure_memory(r.headers['content-length']): 
     #now do stuff here :) 
     html = r.content 
     ... 
     return {'success': True, 'root': root} 

total_mem也可以自動計算的,所以你就不必去猜測什麼每臺機器的正確值...