接下來的一句話引起了我的眼睛在Wget的手冊Web蜘蛛與Wget的蜘蛛有什麼不同?
wget --spider --force-html -i bookmarks.html
This feature needs much more work for Wget to get close to the functionality of real web spiders.
我找到相關的代碼在wget的蜘蛛選項以下行。
src/ftp.c
780: /* If we're in spider mode, don't really retrieve anything. The
784: if (opt.spider)
889: if (!(cmd & (DO_LIST | DO_RETR)) || (opt.spider && !(cmd & DO_LIST)))
1227: if (!opt.spider)
1239: if (!opt.spider)
1268: else if (!opt.spider)
1827: if (opt.htmlify && !opt.spider)
src/http.c
64:#include "spider.h"
2405: /* Skip preliminary HEAD request if we're not in spider mode AND
2407: if (!opt.spider
2428: if (opt.spider && !got_head)
2456: /* Default document type is empty. However, if spider mode is
2570: * spider mode. */
2571: else if (opt.spider)
2661: if (opt.spider)
src/res.c
543: int saved_sp_val = opt.spider;
548: opt.spider = false;
551: opt.spider = saved_sp_val;
src/spider.c
1:/* Keep track of visited URLs in spider mode.
37:#include "spider.h"
49:spider_cleanup (void)
src/spider.h
1:/* Declarations for spider.c
src/recur.c
52:#include "spider.h"
279: if (opt.spider)
366: || opt.spider /* opt.recursive is implicitely true */
370: (otherwise unneeded because of --spider or rejected by -R)
375: (opt.spider ? "--spider" :
378: (opt.delete_after || opt.spider
440: if (opt.spider)
src/options.h
62: bool spider; /* Is Wget in spider mode? */
src/init.c
238: { "spider", &opt.spider, cmd_boolean },
src/main.c
56:#include "spider.h"
238: { "spider", 0, OPT_BOOLEAN, "spider", -1 },
435: --spider don't download anything.\n"),
1045: if (opt.recursive && opt.spider)
我想看看代碼中的差異,而不是抽象的。我喜歡代碼示例。
怎麼辦網絡蜘蛛不同於Wget的在碼蜘蛛?
寫得很好,並說明是一個真正的網絡蜘蛛的挑戰,但我認爲這個問題是指爬行「一個」網站,而不是整個互聯網,可以做得比網絡蜘蛛更好,沒有太多的努力。 – 2009-05-13 19:59:29