我試圖索引所有電影,系列...本網頁的網頁:www.newpct1.com。對於我想要保存其標題,torrent文件URL和文件大小的每個媒體內容。爲此,我使用帶有模塊cheerio的NodeJS(使用jQuery提取HTML內容,如sintax)和請求(make請求)。代碼如下:網絡抓取時內存泄漏
const cheerio = require('cheerio');
const request = require('request');
console.log('\"Site\",\"Title\",\"Size\",\"URL\"');
const baseURL = 'http://newpct1.com/';
const sites = ['documentales/pg/', 'peliculas/pg/', 'series/pg/', 'varios/pg/'];
for (let i = 0; i < sites.length; i++) {
let site = sites[i].split('/')[0];
for (let j = 1; true; j++) { // Infinite loop
let siteURL = baseURL + sites[i] + j;
// getMediaURLs
// -------------------------------------------------------------------------
request(siteURL, (err, resp, body) => {
if (!err) {
let $ = cheerio.load(body);
let lis = $('li', 'ul.pelilist');
// If exists media
if (lis.length) {
$('a', lis).each((k, elem) => {
let mediaURL = $(elem).attr('href');
// getMediaAttrs
//------------------------------------------------------------------
request(mediaURL, (err, resp, body) => {
if (!err) {
let $ = cheerio.load(body);
let title = $('strong', 'h1').text();
let size = $('.imp').eq(1).text().split(':')[1];
let torrent = $('a.btn-torrent').attr('href');
console.log('\"%s\",\"%s\",\"%s\",\"%s\"', site, title, size,
torrent);
}
});
//------------------------------------------------------------------
});
}
}
});
// -------------------------------------------------------------------------
}
}
這段代碼的問題是永遠不會結束的執行,引發此錯誤(內存泄漏):
<--- Last few GCs --->
22242 ms: Mark-sweep 1372.4 (1439.0) -> 1370.7 (1439.0) MB, 1088.7/0.0 ms [allocation failure] [GC in old space requested].
23345 ms: Mark-sweep 1370.7 (1439.0) -> 1370.7 (1439.0) MB, 1103.0/0.0 ms [allocation failure] [GC in old space requested].
24447 ms: Mark-sweep 1370.7 (1439.0) -> 1370.6 (1418.0) MB, 1102.1/0.0 ms [last resort gc].
25527 ms: Mark-sweep 1370.6 (1418.0) -> 1370.6 (1418.0) MB, 1079.5/0.0 ms [last resort gc].
<--- JS stacktrace --->
==== JS stack trace =========================================
Security context: 0x272c0e23fa99 <JS Object>
1: httpify [/home/marco/node_modules/caseless/index.js:~50] [pc=0x3f51b4a2c2c5] (this=0x1e65c39fbdb9 <JS Function module.exports (SharedFunctionInfo 0x1e65c39fb581)>,resp=0x2906174cf6a9 <a Request with map 0x2efe262dbef9>,headers=0x11e0242443f1 <an Object with map 0x2efe26206829>)
2: init [/home/marco/node_modules/request/request.js:~144] [pc=0x3f51b4a3ee1d] (this=0x2906174cf6a9 <a Requ...
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
1: node::Abort() [node]
2: 0x10d3f9c [node]
3: v8::Utils::ReportApiFailure(char const*, char const*) [node]
4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [node]
5: v8::internal::Handle<v8::internal::JSFunction> v8::internal::Factory::New<v8::internal::JSFunction>(v8::internal::Handle<v8::internal::Map>, v8::internal::AllocationSpace) [node]
6: v8::internal::Factory::NewFunction(v8::internal::Handle<v8::internal::Map>, v8::internal::Handle<v8::internal::SharedFunctionInfo>, v8::internal::Handle<v8::internal::Context>, v8::internal::PretenureFlag) [node]
7: v8::internal::Factory::NewFunctionFromSharedFunctionInfo(v8::internal::Handle<v8::internal::SharedFunctionInfo>, v8::internal::Handle<v8::internal::Context>, v8::internal::PretenureFlag) [node]
8: v8::internal::Runtime_NewClosure_Tenured(int, v8::internal::Object**, v8::internal::Isolate*) [node]
9: 0x3f51b47060c7
我嘗試在一臺機器有更多的RAM執行(16 GB)但引發相同的錯誤。
我也做了一個堆快照,但我看不出問題在哪裏。快照是在這裏:https://drive.google.com/open?id=0B5Ysugq64wdLSHdHVHctUXZaNGM
我相信你在那裏做無限的請求,並要求採取空間 – juvian
它看起來像。如果我在第一次迭代工作後做了兩個循環的休息,我不知道如何限制這些要求或者做一些這些,等待完成,然後繼續。 –
在我的腦海裏,'console.log(...)必須一直執行,但事實並非如此。我相信,當所有請求都已經完成時,它會開始打印到CL。 –