如何獲取scrapy失敗網址？

我是scrapy的新手，它是我所知道的驚人的爬蟲框架！如何獲取scrapy失敗網址？

在我的項目中，我發送了超過90,000個請求，但其中有一些失敗了。我將日誌級別設置爲INFO，我只能看到一些統計信息，但沒有詳細信息。

2012-12-05 21:03:04+0800 [pd_spider] INFO: Dumping spider stats: 
{'downloader/exception_count': 1, 
'downloader/exception_type_count/twisted.internet.error.ConnectionDone': 1, 
'downloader/request_bytes': 46282582, 
'downloader/request_count': 92383, 
'downloader/request_method_count/GET': 92383, 
'downloader/response_bytes': 123766459, 
'downloader/response_count': 92382, 
'downloader/response_status_count/200': 92382, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2012, 12, 5, 13, 3, 4, 836000), 
'item_scraped_count': 46191, 
'request_depth_max': 1, 
'scheduler/memory_enqueued': 92383, 
'start_time': datetime.datetime(2012, 12, 5, 12, 23, 25, 427000)}

有什麼辦法可以得到更多的詳細報告嗎？例如，顯示那些失敗的URL。謝謝！

來源

2012-12-05 Joe Wu

是的，這是可能的。

如果響應的狀態是404（這需要擴展以涵蓋其他錯誤狀態），我向我的蜘蛛類添加了一個failed_urls列表並添加了url。

然後，我添加了一個句柄，將列表連接到單個字符串，並在關閉蜘蛛時將其添加到統計信息中。

根據您的意見，可以追蹤Twisted錯誤。

from scrapy.spider import BaseSpider 
from scrapy.xlib.pydispatch import dispatcher 
from scrapy import signals 

class MySpider(BaseSpider): 
    handle_httpstatus_list = [404] 
    name = "myspider" 
    allowed_domains = ["example.com"] 
    start_urls = [ 
     'http://www.example.com/thisurlexists.html', 
     'http://www.example.com/thisurldoesnotexist.html', 
     'http://www.example.com/neitherdoesthisone.html' 
    ] 

    def __init__(self, category=None): 
     self.failed_urls = [] 

    def parse(self, response): 
     if response.status == 404: 
      self.crawler.stats.inc_value('failed_url_count') 
      self.failed_urls.append(response.url) 

    def handle_spider_closed(spider, reason): 
     self.crawler.stats.set_value('failed_urls', ','.join(spider.failed_urls)) 

    def process_exception(self, response, exception, spider): 
     ex_class = "%s.%s" % (exception.__class__.__module__, exception.__class__.__name__) 
     self.crawler.stats.inc_value('downloader/exception_count', spider=spider) 
     self.crawler.stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider) 

    dispatcher.connect(handle_spider_closed, signals.spider_closed)

輸出（下載/ exception_count *統計纔會出現，如果異常拋出其實 - 我試圖通過運行模擬蜘蛛他們我已經關閉之後我的無線適配器）：

2012-12-10 11:15:26+0000 [myspider] INFO: Dumping Scrapy stats: 
    {'downloader/exception_count': 15, 
    'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 15, 
    'downloader/request_bytes': 717, 
    'downloader/request_count': 3, 
    'downloader/request_method_count/GET': 3, 
    'downloader/response_bytes': 15209, 
    'downloader/response_count': 3, 
    'downloader/response_status_count/200': 1, 
    'downloader/response_status_count/404': 2, 
    'failed_url_count': 2, 
    'failed_urls': 'http://www.example.com/thisurldoesnotexist.html, http://www.example.com/neitherdoesthisone.html' 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 874000), 
    'log_count/DEBUG': 9, 
    'log_count/ERROR': 2, 
    'log_count/INFO': 4, 
    'response_received_count': 3, 
    'scheduler/dequeued': 3, 
    'scheduler/dequeued/memory': 3, 
    'scheduler/enqueued': 3, 
    'scheduler/enqueued/memory': 3, 
    'spider_exceptions/NameError': 2, 
    'start_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 560000)}

來源

2012-12-10 11:22:28 Talvalin

這不再有效。 'exceptions.NameError：全局名稱'self'未定義'發生錯誤。 'BaseSpider'現在只是'Spider' http://doc.scrapy.org/en/0.24/news.html？突出= basespider＃ID2 https://github.com/scrapy/dirbot/blob/master/dirbot/spiders/dmoz.py但我無法找到修復，讓您的代碼，但工作@Talvalin。 – Mikeumus

這裏還有一個例子，如何處理和收集的404錯誤（檢查的GitHub幫助頁面）：

from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.item import Item, Field 


class GitHubLinkItem(Item): 
    url = Field() 
    referer = Field() 
    status = Field() 


class GithubHelpSpider(CrawlSpider): 
    name = "github_help" 
    allowed_domains = ["help.github.com"] 
    start_urls = ["https://help.github.com", ] 
    handle_httpstatus_list = [404] 
    rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),) 

    def parse_item(self, response): 
     if response.status == 404: 
      item = GitHubLinkItem() 
      item['url'] = response.url 
      item['referer'] = response.request.headers.get('Referer') 
      item['status'] = response.status 

      return item

只要運行scrapy runspider與-o output.json看看output.json文件中的項目列表。

來源

2013-01-29 22:49:49 alecxe

@Talvalin和@alecxe的答案對我有很大的幫助，但他們似乎沒有捕獲不生成響應對象的下載器事件（例如，twisted.internet.error.TimeoutError和）。這些錯誤顯示在運行結束時的統計信息轉儲中，但沒有任何元信息。

正如我發現here，該錯誤由Stats.py中間件跟蹤，在DownloaderStats類process_exception方法捕獲，並且特別是在ex_class變量，該變量遞增根據需要爲每個錯誤類型，然後在轉儲的計數運行結束。

爲了從相應的請求對象匹配這樣的錯誤的信息，可以給每個請求添加的元信息（通過request.meta），然後將其拉入process_exception方法Stats.py的：

self.stats.set_value('downloader/my_errs/%s' % request.meta, ex_class)

這將產生這種錯誤的唯一字符串。能拯救改變Stats.py爲Mystats.py，將其添加到中間件（用正確的優先級），並禁用定期Stats.py：

DOWNLOADER_MIDDLEWARES = { 
    'myproject.mystats.MyDownloaderStats': 850, 
    'scrapy.downloadermiddleware.stats.DownloaderStats': None, 
    }

輸出在運行結束看起來是這樣的（這裏使用元信息的地方網址/請求都被映射基於整數的組ID/MEMBERID的元str到，像'0/14'）：與非基於下載錯誤

{'downloader/exception_count': 3, 
'downloader/exception_type_count/twisted.web.http.PotentialDataLoss': 3, 
'downloader/my_errs/0/1': 'twisted.web.http.PotentialDataLoss', 
'downloader/my_errs/0/38': 'twisted.web.http.PotentialDataLoss', 
'downloader/my_errs/0/86': 'twisted.web.http.PotentialDataLoss', 
'downloader/request_bytes': 47583, 
'downloader/request_count': 133, 
'downloader/request_method_count/GET': 133, 
'downloader/response_bytes': 3416996, 
'downloader/response_count': 130, 
'downloader/response_status_count/200': 95, 
'downloader/response_status_count/301': 24, 
'downloader/response_status_count/302': 8, 
'downloader/response_status_count/500': 3, 
'finish_reason': 'finished'....}

This answer交易。

來源

2013-08-26 21:52:47 bahmait

正是我在找什麼。我認爲Scrapy應該添加這個功能來提供對URL等失敗信息的方便訪問。 – wlnirvana

使用'scrapy.downloadermiddlewares.stats'，而不是過時的最新版本（1.0.5）版本'scrapy.contrib.downloadermiddleware.stats' –

@ElRuso謝謝 - 已更新了答案 – bahmait

從scrapy 0.24.6起，alecxe建議的方法不會捕獲起始URL的錯誤。要記錄起始網址的錯誤，您需要覆蓋parse_start_urls。修改alexce的答案爲此，你會得到：

from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.item import Item, Field 

class GitHubLinkItem(Item): 
    url = Field() 
    referer = Field() 
    status = Field() 

class GithubHelpSpider(CrawlSpider): 
    name = "github_help" 
    allowed_domains = ["help.github.com"] 
    start_urls = ["https://help.github.com", ] 
    handle_httpstatus_list = [404] 
    rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),) 

    def parse_start_url(self, response): 
     return self.handle_response(response) 

    def parse_item(self, response): 
     return self.handle_response(response) 

    def handle_response(self, response): 
     if response.status == 404: 
      item = GitHubLinkItem() 
      item['url'] = response.url 
      item['referer'] = response.request.headers.get('Referer') 
      item['status'] = response.status 

      return item

來源

2015-05-29 11:08:13 Louis

這是對這個問題的更新。我碰到類似的問題，需要使用scrapy信號來調用管道中的某個功能。我編輯了@ Talvalin的代碼，但爲了更清晰一些，我想回答一個問題。

基本上，您應該添加self作爲handle_spider_closed的參數。您還應該在init中調用調度程序，以便您可以將spider實例（self）傳遞給處理方法。

from scrapy.spider import Spider 
from scrapy.xlib.pydispatch import dispatcher 
from scrapy import signals 

class MySpider(Spider): 
    handle_httpstatus_list = [404] 
    name = "myspider" 
    allowed_domains = ["example.com"] 
    start_urls = [ 
     'http://www.example.com/thisurlexists.html', 
     'http://www.example.com/thisurldoesnotexist.html', 
     'http://www.example.com/neitherdoesthisone.html' 
    ] 

    def __init__(self, category=None): 
     self.failed_urls = [] 
     # the dispatcher is now called in init 
     dispatcher.connect(self.handle_spider_closed,signals.spider_closed) 


    def parse(self, response): 
     if response.status == 404: 
      self.crawler.stats.inc_value('failed_url_count') 
      self.failed_urls.append(response.url) 

    def handle_spider_closed(self, spider, reason): # added self 
     self.crawler.stats.set_value('failed_urls',','.join(spider.failed_urls)) 

    def process_exception(self, response, exception, spider): 
     ex_class = "%s.%s" % (exception.__class__.__module__, exception.__class__.__name__) 
     self.crawler.stats.inc_value('downloader/exception_count', spider=spider) 
     self.crawler.stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider)

我希望這可以幫助任何人在將來遇到同樣的問題。

來源

2015-07-06 09:44:48 Mattias

Scrapy默認情況下會忽略404，不要解析。要處理404錯誤，請執行此操作。這是很容易的，如果你是在響應收到錯誤代碼404，你可以處理這個是非常簡單的方法..... 在設置寫

HTTPERROR_ALLOWED_CODES = [404,403]

和再處理響應狀態代碼的解析功能。

def parse(self,response): 
    if response.status == 404: 
     #your action on error

在設置

，並得到響應的解析函數

來源

2015-11-28 13:45:32

除了一些回答這些問題，如果你想跟蹤扭曲的錯誤，我會看看使用Request對象的errback參數，其上您可以設置一個回調函數，在請求失敗時使用Twisted Failure調用。除了url之外，該方法還可以讓您跟蹤失敗的類型。

然後，您可以通過登錄網址：failure.request.url（其中failure傳遞到errback的扭曲Failure對象）。

# these would be in a Spider 
def start_requests(self): 
    for url in self.start_urls: 
     yield scrapy.Request(url, callback=self.parse, 
            errback=self.handle_error) 

def handle_error(self, failure): 
    url = failure.request.url 
    logging.error('Failure type: %s, URL: %s', failure.type, 
               url)

的Scrapy文檔給出瞭如何可以做到這一點一個完整的例子，除了到Scrapy記錄器的調用都depreciated，所以我已經適應了使用Python的內置logging）我的例子：

https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-errbacks

來源

2017-08-31 23:37:24

如何獲取scrapy失敗網址？

回答

相關問題