這是一個非常深刻的主題,有一堆解決方案。但是,如果您想要應用您在文章中定義的邏輯,則可以使用scrapy Downloader Middlewares。
喜歡的東西:
class CaptchaMiddleware(object):
max_retries = 5
def process_response(request, response, spider):
if not request.meta.get('solve_captcha', False):
return response # only solve requests that are marked with meta key
catpcha = find_catpcha(response)
if not captcha: # it might not have captcha at all!
return response
solved = solve_captcha(captcha)
if solved:
response.meta['catpcha'] = captcha
response.meta['solved_catpcha'] = solved
return response
else:
# retry page for new captcha
# prevent endless loop
if request.meta.get('catpcha_retries', 0) == 5:
logging.warning('max retries for captcha reached for {}'.format(request.url))
raise IgnoreRequest
request.meta['dont_filter'] = True
request.meta['captcha_retries'] = request.meta.get('captcha_retries', 0) + 1
return request
這個例子將截獲每個響應,並嘗試解決驗證碼。如果失敗,它會重新嘗試新的驗證碼頁面,如果成功,它會添加一些元鍵來解決captcha值的響應。
在你的蜘蛛中,你會這樣使用它:
class MySpider(scrapy.Spider):
def parse(self, response):
url = ''# url that requires captcha
yield Request(url, callback=self.parse_captchad, meta={'solve_captcha': True},
errback=self.parse_fail)
def parse_captchad(self, response):
solved = response['solved']
# do stuff
def parse_fail(self, response):
# failed to retrieve captcha in 5 tries :(
# do stuff