Scrapy網頁抓取工具無法抓取鏈接

我是Scrapy的新手。在這裏，我的蜘蛛爬行twistedweb。Scrapy網頁抓取工具無法抓取鏈接

class TwistedWebSpider(BaseSpider): 

    name = "twistedweb3" 
    allowed_domains = ["twistedmatrix.com"] 
    start_urls = [ 
     "http://twistedmatrix.com/documents/current/web/howto/", 
    ] 
    rules = (
     Rule(SgmlLinkExtractor(), 
      'parse', 
      follow=True, 
     ), 
    ) 
    def parse(self, response): 
     print response.url 
     filename = response.url.split("/")[-1] 
     filename = filename or "index.html" 
     open(filename, 'wb').write(response.body)

當我運行scrapy-ctl.py crawl twistedweb3時，它只提取。

獲取index.html內容，我嘗試使用SgmlLinkExtractor，它提取鏈接，如我所料，但不能遵循這些鏈接。

你能告訴我我要去哪裏嗎？

假設我想獲得css，javascript文件。我如何實現這一目標？我的意思是讓完整的網站？

來源

2010-08-19 Iapilgrim

你還沒有在這裏顯示足夠的代碼，甚至猜測你的問題是什麼。我建議你完成好Scrapy教程，然後你的問題要麼自己回答，要麼你可以解釋問題是什麼。 http://doc.scrapy.org/intro/tutorial.html – msw 2010-08-19 02:49:15

我確實按照教程。我在上面看到了一點蜘蛛。 – Iapilgrim 2010-08-20 06:08:02

rules屬性屬於CrawlSpider。使用class MySpider(CrawlSpider)。此外，當您使用CrawlSpider時，您不得覆蓋parse方法，而改用parse_response或其他類似的名稱。

來源

2010-08-19 04:58:53 Rolando

感謝Rho。你救了我一天。它按照您的建議修改後生效 – Iapilgrim 2010-08-20 06:11:19

Scrapy網頁抓取工具無法抓取鏈接

回答

相關問題