Scrapy句柄302響應代碼

我正在使用簡單的CrawlSpider實現來抓取網站。默認Scrapy遵循302重定向到目標位置，並忽略最初請求的鏈接。在一個特定的網站上，我遇到了一個頁面，該頁面被重定向到另一個頁面。我的目標是記錄原始鏈接（響應302）和目標位置（在HTTP響應頭中指定）並在parse_item方法CrawlSpider中處理它們。請指導我，我該如何實現這一目標？Scrapy句柄302響應代碼

我遇到了提及使用dont_redirect=True或REDIRECT_ENABLE=False的解決方案，但我實際上並不想忽略重定向，實際上我想考慮（即不忽略）重定向頁面。

例如：我訪問http://www.example.com/page1，它發送302重定向HTTP響應並重定向到http://www.example.com/page2。默認情況下，scrapy忽略page1，接下來是page2並對其進行處理。我想要在parse_item中處理page1和page2。

編輯我已經在蜘蛛類定義使用handle_httpstatus_list = [500, 404]來處理parse_item500和404響應代碼，但同樣沒有爲302工作，如果我在handle_httpstatus_list指定。

來源

2016-02-11 bawejakunal

你能提供一個給你302 HTTP狀態的網址嗎？ – Rahul

Scrapy 1.0.5（我寫這些行的最新官員）在內置的RedirectMiddleware中不使用handle_httpstatus_list - 請參閱this issue。來自Scrapy 1.1.0（1.1.0rc1 is available），the issue is fixed。

即使禁用重定向，你仍然可以模仿其行爲在回調，檢查Location頭並返回一個Request到重定向

例蜘蛛：

$ cat redirecttest.py 
import scrapy 


class RedirectTest(scrapy.Spider): 

    name = "redirecttest" 
    start_urls = [ 
     'http://httpbin.org/get', 
     'https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip' 
    ] 
    handle_httpstatus_list = [302] 

    def start_requests(self): 
     for url in self.start_urls: 
      yield scrapy.Request(url, dont_filter=True, callback=self.parse_page) 

    def parse_page(self, response): 
     self.logger.debug("(parse_page) response: status=%d, URL=%s" % (response.status, response.url)) 
     if response.status in (302,) and 'Location' in response.headers: 
      self.logger.debug("(parse_page) Location header: %r" % response.headers['Location']) 
      yield scrapy.Request(
       response.urljoin(response.headers['Location']), 
       callback=self.parse_page)

控制檯日誌：

$ scrapy runspider redirecttest.py -s REDIRECT_ENABLED=0 
[scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot) 
[scrapy] INFO: Optional features available: ssl, http11 
[scrapy] INFO: Overridden settings: {'REDIRECT_ENABLED': '0'} 
[scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
[scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
[scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
[scrapy] INFO: Enabled item pipelines: 
[scrapy] INFO: Spider opened 
[scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
[scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
[scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/get> (referer: None) 
[redirecttest] DEBUG: (parse_page) response: status=200, URL=http://httpbin.org/get 
[scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip> (referer: None) 
[redirecttest] DEBUG: (parse_page) response: status=302, URL=https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip 
[redirecttest] DEBUG: (parse_page) Location header: 'http://httpbin.org/ip' 
[scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/ip> (referer: https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip) 
[redirecttest] DEBUG: (parse_page) response: status=200, URL=http://httpbin.org/ip 
[scrapy] INFO: Closing spider (finished)

請注意，您需要http_handlestatus_list，其中有302個，否則，您會看到這個親屬（來自HttpErrorMiddleware）：

[scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip> (referer: None) 
[scrapy] DEBUG: Ignoring response <302 https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip>: HTTP status code is not handled or not allowed

來源

2016-02-11 09:39:07

是的，這正是我現在所做的:) – bawejakunal

重定向中間件會在響應到達httperror中間件之前「捕捉」響應，並使用重定向url啓動新的請求。同時，原始響應不會被返回，即，因爲它們沒有達到httperror，你甚至不會「看到」302代碼。因此在handle_httpstatus_list中有302沒有效果。

看看它在scrapy.downloadermiddlewares.redirect.RedirectMiddleware中的源代碼：在process_response（）中，你會看到發生了什麼。它啓動一個新的請求並用redirected_url替換原來的URL。沒有「回覆回覆」 - >原來的回覆會被丟棄。

基本上你只需要通過添加一行「return response」來覆蓋process_response（）函數，除了用redirected_url發送另一個請求。

在parse_item中，您可能想要設置一些條件語句，具體取決於它是否是重定向？我想它不會看起來完全一樣，所以也許你的物品看起來也會完全不同。另一種選擇也可以是對任一響應使用不同的解析器（取決於原始或重定向的url是否爲「特殊頁面」），然後您需要在蜘蛛中使用不同的解析函數，例如parse_redirected_urls（）在重定向請求中通過回調調用該解析函數

來源

2016-02-11 09:37:28 Ruehri

Scrapy句柄302響應代碼

回答

相關問題