nodejs內存泄漏（async.queue和請求）

我有一個非常簡單的爬蟲，它通過250頁，分配大約400mb的內存，永遠不會釋放它。我不知道如何解決它，也許有人注意到一些事情，並善意讓我知道。nodejs內存泄漏（async.queue和請求）

function scrape(shop, o, cb, step) { 

    var itemz = [] 

    var q = async.queue(function (o, cb) { 
     req({ 
      url: o.url 
     }, function (e, r) { 
      if (e) throw (e) 
      cb() 
      o.cb(r.body) 
     }) 
    }, o.threads) 
    var get = function (url, cb) { 
     q.push({ 
      url: url, 
      cb: cb 
     }) 
    } 

    var url = 'https://www.host.com' 
    var total, done = 0, 
     itemsPerPage = 24 

    get(url, function (r) { 

     pages = (r.match(/data-page="(\d+)"/g)); 
     pages = pages[pages.length - 2].split("data-page=\"")[1].split('"')[0] || 1; 
     pages = Math.min(pages, 10) // limit to 10 pages max (240 items) 

     for (var i = 1; i <= pages; i++) { 
      get(url + '&page=' + i, scrapeList) 
     } 
     total = pages + pages * itemsPerPage 
    }) 

    // - extract the transaction links from the pages: 
    // and add them to queue 
    function scrapeList(r) { 
     var itemsFound = 0 

     r.replace(/href="(https:\/\/www.host.com\/listing\/(\d+).*)"/g, function (s, itemUrl, dateSold) { 
      itemsFound++ 
      get(itemUrl, function (r) { 
       scrapeItem(r, itemUrl, dateSold) 
       step(++done, total) 
       if (done == total) onend() 
      }) 
     }) 

     total -= itemsPerPage - itemsFound // decrease expected items, if less items per page found than initially expected 
     step(++done, total) 
    } 

    // - from item page extract the details, and add to items array 
    function scrapeItem(r, itemUrl, dateSold) { 

     var d = {} 
     d.url = itemUrl; 

     d.date = new Date(Date.now()) 

     d.quantity = 1; 

     itemz.push(d) 
    } 

    // - when no more requests in a queue (on drain), group items by title 
    function onend() { 

     cb(null, itemz); 

    } 
}

來源

2016-07-05 freddor

你是如何調用'刮（...）'函數一個有趣的閱讀？它將數組返回給它的回調函數。如果您持久地存儲該數組，那將是一組持久數據。 – jfriend00

我將它存儲在數組中，並且setInterval每分鐘遍歷該數組並清除它（刪除緩存[k]）; – freddor

您是否運行堆快照並檢查了堆中的內容？ – jfriend00

我有一個類似的問題在那裏我被刮取的主機和使用cheerio解析HTML，但cheerio使用lodash內部有內存泄漏，它從來沒有公佈過，所以我發現周圍觸發GC（垃圾收集器的工作）上定期釋放內存，只需撥打global.gc(); reqular間隔後，用標誌運行腳本，而--expose-gc

如：node <script>.js --expose-gc.

這不是一個理想的解決方案，但其速戰速決爲像您這樣的獨立腳本看到here，也不要保持時間間隔太短，因爲我發現垃圾收集是CPU密集型的，也延遲了事件循環，因此每5到10秒就應該這樣做。

此外，我發現了大約V8垃圾收集here

來源

2016-07-06 12:35:35 AJS

有趣的是，我一直認爲，如果內存可以通過垃圾收集（強制或不強制）釋放，它不會泄漏。 – robertklep

@robertklep在分析我的應用程序後，我發現cheerio由於使用loadash而創建了巨大的數組和對象。 – AJS

@robertklep如果您有興趣，請參閱我的答案中的第二個鏈接，它對v8垃圾收集的工作原理做了一些說明。 – AJS

nodejs內存泄漏（async.queue和請求）

回答

相關問題