網絡爬蟲使用雙絞線

我想創建Python和twisted.What happend網絡爬蟲是在調用 reactor.run()網絡爬蟲使用雙絞線

的時候，我不知道所有的鏈接來獲取。這樣的代碼是這樣：

def crawl(url): 
    d = getPage(url) 
    d.addCallback(handlePage) 
    reactor.run()

和手柄頁有這樣的：

def handlePage(output): 
    urls = getAllUrls(output)

所以現在我需要在每個urls.How網址的應用爬行（）做我這樣做嗎？我應該停止反應堆並重新開始嗎？如果我錯過了某些明顯的事情，請告訴我。

來源

2012-04-18 Vignesh

你不想停止反應堆。你只是想下載更多的頁面。所以你需要重構你的crawl功能不停止或啓動反應堆。

def crawl(url): 
    d = getPage(url) 
    d.addCallback(handlePage) 

def handlePage(output): 
    urls = getAllUrls(output) 
    for url in urls: 
     crawl(url) 

crawl(url) 
reactor.run()

你可能想看看scrapy，而不是從頭開始構建自己的。

來源

2012-04-18 19:42:33

謝謝，我不知道那是這麼簡單！ – Vignesh 2012-04-19 01:21:18

網絡爬蟲使用雙絞線

回答

相關問題