2017-10-11 130 views
0

夥計! 我試圖讓整個網站的所有內部網址用於搜索引擎優化的目的,我最近發現了Scrapy來幫助我完成這項任務。但我的代碼總是返回一個錯誤:使用Scrapy獲取整個網站中的所有網址

2017-10-11 10:32:00 [scrapy.core.engine] INFO: Spider opened 
2017-10-11 10:32:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min 
) 
2017-10-11 10:32:00 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
2017-10-11 10:32:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.**test**.com/> from 
<GET http://www.**test**.com/robots.txt> 
2017-10-11 10:32:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.**test**.com/> (referer: None) 
2017-10-11 10:32:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.**test**.com/> from 
<GET http://www.**test**.com> 
2017-10-11 10:32:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.**test**.com/> (referer: None) 
2017-10-11 10:32:03 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.**test**.com/> (referer: None) 
Traceback (most recent call last): 
    File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks 
    current.result = callback(current.result, *args, **kw) 
    File "c:\python27\lib\site-packages\scrapy\spiders\__init__.py", line 90, in parse 
    raise NotImplementedError 
NotImplementedError 

我更改了原始URL。

這裏是我跑

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 


class TestSpider(scrapy.Spider): 
    name = "test" 
    allowed_domains = ["http://www.test.com"] 
    start_urls = ["http://www.test.com"] 

    rules = [Rule (LinkExtractor(allow=['.*']))] 

感謝代碼!

編輯:

這爲我工作:

rules = (
    Rule(LinkExtractor(), callback='parse_item', follow=True), 
) 

def parse_item(self, response): 
    filename = response.url 
    arquivo = open("file.txt", "a") 
    string = str(filename) 
    arquivo.write(string+ '\n') 
    arquivo.close 

= d

+0

歡迎來到SO!我建議你將問題的解決方案作爲答案發布。這將有助於未來的讀者更好地理解問題和答案。 – Nisarg

回答

0

你得到是由以下事實導致你沒有在你的蜘蛛定義parse方法錯誤,如果您將您的蜘蛛基於scrapy.Spider課程,則這是強制性要求。

爲了您的目的(即爬行整個網站),最好將您的蜘蛛基於scrapy.CrawlSpider課程。此外,在Rule中,您必須將callback屬性定義爲解析您訪問的每個頁面的方法。如果你想訪問每個頁面,最後一個整體變化,在LinkExtractor,你可以省略allow,因爲它的默認值是空的元組,這意味着它將匹配找到的所有鏈接。

有關具體代碼,請參閱CrawlSpider example