如何設置Scrapy來處理驗證碼

我試圖抓取一個需要用戶輸入搜索值和驗證碼的網站。我已經爲驗證碼獲得了光學字符識別（OCR）例程，大約33％的時間成功。由於驗證碼始終是字母文本，因此如果OCR函數返回非字母字符，我想重新加載驗證碼。一旦我有文字「單詞」，我想提交搜索表單。如何設置Scrapy來處理驗證碼

結果返回到同一頁面，表單準備好進行新的搜索和新的驗證碼。所以我需要衝洗並重復，直到我用盡了搜索條件。

這裏的頂層算法：

加載頁面最初
下載驗證碼圖像，通過OCR
運行它如果OCR不以純文本的結果回來，刷新驗證碼並重復此步驟
提交查詢的形式在頁面與搜索項和驗證碼
檢查響應看看驗證碼是否正確
如果是正確的，颳去數據
轉到2

我已經使用管道用於獲取驗證碼想說，但我沒有對錶單提交的值。如果我只是在沒有通過框架的情況下獲取圖像，使用urllib或其他東西，那麼會話cookie不會被提交，所以服務器上的驗證碼驗證失敗。

什麼是理想的Scrapy這樣做的方式？

來源

2016-08-25 Sushil

這是一個非常深刻的主題，有一堆解決方案。但是，如果您想要應用您在文章中定義的邏輯，則可以使用scrapy Downloader Middlewares。

喜歡的東西：

class CaptchaMiddleware(object): 
    max_retries = 5 
    def process_response(request, response, spider): 
     if not request.meta.get('solve_captcha', False): 
      return response # only solve requests that are marked with meta key 
     catpcha = find_catpcha(response) 
     if not captcha: # it might not have captcha at all! 
      return response 
     solved = solve_captcha(captcha) 
     if solved: 
      response.meta['catpcha'] = captcha 
      response.meta['solved_catpcha'] = solved 
      return response 
     else: 
      # retry page for new captcha 
      # prevent endless loop 
      if request.meta.get('catpcha_retries', 0) == 5: 
       logging.warning('max retries for captcha reached for {}'.format(request.url)) 
       raise IgnoreRequest 
      request.meta['dont_filter'] = True 
      request.meta['captcha_retries'] = request.meta.get('captcha_retries', 0) + 1 
      return request

這個例子將截獲每個響應，並嘗試解決驗證碼。如果失敗，它會重新嘗試新的驗證碼頁面，如果成功，它會添加一些元鍵來解決captcha值的響應。
在你的蜘蛛中，你會這樣使用它：

class MySpider(scrapy.Spider): 
    def parse(self, response): 
     url = ''# url that requires captcha 
     yield Request(url, callback=self.parse_captchad, meta={'solve_captcha': True}, 
         errback=self.parse_fail) 

    def parse_captchad(self, response): 
     solved = response['solved'] 
     # do stuff 

    def parse_fail(self, response): 
     # failed to retrieve captcha in 5 tries :(
     # do stuff

來源

2016-08-25 08:34:42 Granitosaurus

如何設置Scrapy來處理驗證碼

回答

相關問題