使用scrapy獲取404錯誤的所有實例

我讓Scrapy抓取我的站點，找到404響應的鏈接並將它們返回給JSON文件。這工作得很好。使用scrapy獲取404錯誤的所有實例

但是，我不知道如何獲取該錯誤鏈接的所有實例，因爲重複過濾器正在捕獲這些鏈接，而不是重試它們。

由於我們的網站有成千上萬的頁面，這些部分由多個團隊管理，我需要能夠爲每個部分創建一個壞鏈接報告，而不是找到一個報告並在整個網站上進行搜索替換。

任何幫助將不勝感激。

我目前的履帶：

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor 
from scrapy.item import Item, Field 

# Add Items for exporting to JSON 
class DevelopersLinkItem(Item): 
    url = Field() 
    referer = Field() 
    link_text = Field() 
    status = Field() 
    time = Field() 

class DevelopersSpider(CrawlSpider): 
    """Subclasses Crawlspider to crawl the given site and parses each link to JSON""" 

    # Spider name to be used when calling from the terminal 
    name = "developers_prod" 

    # Allow only the given host name(s) 
    allowed_domains = ["example.com"] 

    # Start crawling from this URL 
    start_urls = ["https://example.com"] 

    # Which status should be reported 
    handle_httpstatus_list = [404] 

    # Rules on how to extract links from the DOM, which URLS to deny, and gives a callback if needed 
    rules = (Rule(LxmlLinkExtractor(deny=([ 
     '/android/'])), callback='parse_item', follow=True),) 

    # Called back to for each requested page and used for parsing the response 
    def parse_item(self, response): 
     if response.status == 404: 
      item = DevelopersLinkItem() 
      item['url'] = response.url 
      item['referer'] = response.request.headers.get('Referer') 
      item['link_text'] = response.meta.get('link_text') 
      item['status'] = response.status 
      item['time'] = self.now.strftime("%Y-%m-%d %H:%M") 

      return item

我已經嘗試了一些自定義過濾器重複數據刪除，但最終沒有一次成功。

來源

2017-10-10 isaac86hatch

如果我正確理解您的問題，則默認情況下您的請求會被抓取工具所過濾。您可以使用Rule類的process_request參數爲每個請求設置dont_filter = True（https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule）

來源

2017-10-11 01:31:55 elacuesta

使用scrapy獲取404錯誤的所有實例

回答

相關問題