如何的情況下直接請求

我有十萬URL S的我刮掉使用一個蜘蛛數據庫發送twisted.internet.error.TimeoutError。例如100個URL S能夠具有相同的域：如何的情況下直接請求

http://notsame.com/1 
http://notsame2.com/1 
http://dom.com/1 
http://dom.com/2 
http://dom.com/3 
...

的問題是，有時網頁/域沒有返回，所以我越來越<twisted.python.failure.Failure twisted.internet.error.TimeoutError: User timeout caused connection failure:。這與域的所有URL相同。

我想檢測超時例如，對於同一個域的5個網址，然後，如果我敢肯定，這主機存在一些問題，避免請求這個域名了，直接提高<twisted.python.failure.Failure twisted.internet.error.TimeoutError: User timeout caused connection failure:

是否有可能？如果是，如何？

編輯：

我的想法（有rrschmidt的幫助編輯）：

class TimeoutProcessMiddleware: 
    _timeouted_domains = set() 

    def process_request(request,spider): 
     domain = get_domain(request.url) 
     if domain in _timeouted_domains: 
      return twisted.internet.error.TimeoutError 
     return request 

    def process_response(request, exception, spider): 
     # left out the code for counting timeouts for clarity 
     if is_timeout_exception(exception): 
      self._timeouted_domains.add(get_domain(request.url))

來源

2017-05-22 Milano Slesarik

你在正確的軌道與你建立一個TimeoutProcessMiddleware的想法上。更具體地說，我將把它作爲下載中間件來構建。

一個下載中間件可以觸摸每個呼出請求以及每個傳入響應...和...它也可以處理彈出在處理請求/響應每個異常。詳細信息：https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

所以我會做什麼（在未經檢驗的依據，可能需要一些細化和微調）：

class TimoutProcessMiddleware(scrapy.downloadermiddlewares.DownloaderMiddleware): 
    _timeouted_domains = set() 

    def process_request(request, spider): 
     domain = get_domain(request.url) 
     if domain in self._timeouted_domains: 
      raise IgnoreRequest(): 

    def process_response(request, exception, spider): 
     # left out the code for counting timeouts for clarity 
     if is_timeout_exception(exception): 
      self._timeouted_domains.add(get_domain(request.url))

來源

2017-05-22 12:45:38 rrschmidt

謝謝。就一件事。在def err（self）中，我處理Timeouted url，所以如果我只是強制使用Timeout錯誤而不是引發IgnoreRequest（），那將會更簡單。可能嗎？我不能直接提出twisted.internet.error.TimeoutError？我想處理所有超時的域名，因爲它們是超時的。 –

我已經使用你的編輯我的代碼。你怎麼看待這件事？ –

基於我們不得不提高'IgnoreRequest'在'process_request'的文件上...您可以嘗試使用'TimeoutError'但我不知道會發生什麼，然後。同樣在我的例子中，中間件可以避免使用蜘蛛中的err回調。 – rrschmidt

如何的情況下直接請求

回答

相關問題