我不是很熟悉Node.js的內部工作,但據我所知,當你進行過多的函數調用時,會得到'Maximum call stack size exceeded'錯誤。Node.js中的大量數據能否超過堆棧大小?
我正在製作一個蜘蛛,它會跟隨鏈接,並且我在隨機數量的抓取的URL後開始獲取這些錯誤。發生這種情況時,節點不會給你一個堆棧跟蹤,但我很確定我沒有任何遞歸錯誤。
我使用request來獲取URL,我是使用cheerio來解析獲取的HTML並檢測新的鏈接。 cheerio中總是發生堆棧溢出。當我爲htmlparser2換上啦啦隊時,錯誤消失了。 Htmlparser2輕得多,因爲它只是在每個打開的標籤上發佈事件,而不是解析整個文檔並構建樹。
我的理論是,cheerio吃了堆棧中的所有內存,但我不確定這是甚至可能嗎?
這裏是我的代碼的簡化版本(它是隻讀,它不會運行):
var _ = require('underscore');
var fs = require('fs');
var urllib = require('url');
var request = require('request');
var cheerio = require('cheerio');
var mongo = "This is a global connection to mongodb.";
var maxConc = 7;
var crawler = {
concurrent: 0,
queue: [],
fetched: {},
fetch: function(url) {
var self = this;
self.concurrent += 1;
self.fetched[url] = 0;
request.get(url, { timeout: 10000, pool: { maxSockets: maxConc } }, function(err, response, body){
self.concurrent -= 1;
self.fetched[url] = 1;
self.extract(url, body);
});
},
extract: function(referrer, data) {
var self = this;
var urls = [];
mongo.pages.insert({ _id: referrer, html: data, time: +(new Date) });
/**
* THE ERROR HAPPENS HERE, AFTER A RANDOM NUMBER OF FETCHED PAGES
**/
cheerio.load(data)('a').each(function(){
var href = resolve(this.attribs.href, referer); // resolves relative urls, not important
// Save the href only if it hasn't been fetched, it's not already in the queue and it's not already on this page
if(href && !_.has(self.fetched, href) && !_.contains(self.queue, href) && !_.contains(urls, href))
urls.push(href);
});
// Check the database to see if we already visited some urls.
mongo.pages.find({ _id: { $in: urls } }, { _id: 1 }).toArray(function(err, results){
if(err) results = [];
else results = _.pluck(results, '_id');
urls = urls.filter(function(url){ return !_.contains(results, url); });
self.push(urls);
});
},
push: function(urls) {
Array.prototype.push.apply(this.queue, urls);
var url, self = this;
while((url = self.queue.shift()) && this.concurrent < maxConc) {
self.fetch(url);
}
}
};
crawler.fetch('http://some.test.url.com/');
我得到與cheerio相同的錯誤..你找出原因? – Lloyd
不幸的是沒有。對於這個項目,僅僅使用htmlparser2就足夠了 - 並且錯誤不會發生。 – disc0dancer
ok ..最後,我不得不手動操縱html文本,在將它傳遞給cheerio之前,我解析了它,剝離了我並不關心的所有標記。 – Lloyd