2016-12-03 164 views
0

我是一個初學者,使用scrapy,我遇到了這個問題登錄。我只是把所有的表單數據放入FormRequest中。Scrapy蜘蛛登錄問題

我的代碼:

from scrapy.http import Request, FormRequest 
from scrapy.selector import Selector 
from scrapy.contrib.spiders import CrawlSpider 

class login_spider(CrawlSpider): 
    name = 'login_spider' 

    FORM = {"_xsrf":"776a978b48e9e828a939c096ae9b787e", 
     "password":"...", 
     "captcha_type":"cn", 
     "email":"...", 
     } 

    COOKIES = { 
    "q_c1":"201afdf74fab4f538d15fd8726c1fe14|1480730632000|1480730632000", 
    "_xsrf":"776a978b48e9e828a939c096ae9b787e", 
    "l_cap_id":"MDE2MzhmNGUwN2FjNDA1ZTk3NDc5ZDZkZmJhMzM3Y2M=|1480730632|83da14e1526864adfa6e0bec5a9f49bf46f8c460", 
      "cap_id":"OGY2MWMzODIxY2VmNGQ4MGExOTk4N2UwNzU1OWFlYzM=|1480730632|77b6eaaca21f9c96ecfa5d5c9832e34dc2e401e0", 
    "d_c0":"ADDCXsSu8AqPTuqHLcmhlUeOsUY-UBuyRL0=|1480730633", 
     "r_cap_id":"Mjg0YTg2NTcxMjAxNDU2YTljZGNhMjQ1MzVlMjE4ZmI=|1480730633|cd2007eb5d1c6939ac1954b79b83f0d7b5d9e937", 
    "_zap":"57aed33d-98b6-4e98-bad4-71581265abde", 
    "__utmt":1, 
    "__utma":"51854390.1175567315.1480730634.1480730634.1480730634.1", 
    "__utmb":"51854390.4.10.1480730634", 
    "__utmc":"51854390", 
    "__utmz":"51854390.1480730634.1.1.utmcsr=bing|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)", 
    "__utmv":"51854390.000--|3=entry_date=20161203=1", 
    "n_c":1, 
} 

    HEADERS = { 
    "Accept":"*/*", 
    "Accept-Encoding":"gzip, deflate, br", 
    "Accept-Language":"en-US,en;q=0.8", 
    "Connection":"keep-alive", 
    "Content-Length":"100", 
    "Content-Type":"application/x-www-form-urlencoded; charset=UTF-8", 
    "Host":"www.zhihu.com", 
    "Origin":"https://www.zhihu.com", 
    "Referer":"https://www.zhihu.com/", 
    "User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36", 
    "X-Requested-With":"XMLHttpRequest", 
    "X-Xsrftoken":"776a978b48e9e828a939c096ae9b787e", 
} 

    def start_requests(self): 
     return [Request(url="https://www.zhihu.com/#signin", callback=self.login)] 

    def login(self, response): 
     return [FormRequest(
      "https://www.zhihu.com/#signin", 
      formdata=self.FORM, 
      cookies=self.COOKIES, 
      headers=self.HEADERS, 
      callback=self.after_login, 
      dont_filter=True 
     )] 

    def after_login(self, response): 
     print("================\n") 
     print("=== LOG IN ===\n") 
     print("================\n") 

我從這裏獲取表單數據: The email % pwd are randomly created

而且我得到這些:

2016-12-03 11:07:07 [scrapy] INFO: Spider opened 
2016-12-03 11:07:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-12-03 11:07:07 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024 
2016-12-03 11:07:07 [scrapy] DEBUG: Crawled (200) <GET https://www.zhihu.com/robots.txt> (referer: None) 
2016-12-03 11:07:07 [scrapy] DEBUG: Crawled (200) <GET https://www.zhihu.com/#signin> (referer: None) 
2016-12-03 11:07:07 [scrapy] DEBUG: Crawled (400) <POST https://www.zhihu.com/#signin> (referer: https://www.zhihu.com/) ['partial'] 
2016-12-03 11:07:08 [scrapy] DEBUG: Ignoring response <400 https://www.zhihu.com/>: HTTP status code is not handled or not allowed 
2016-12-03 11:07:08 [scrapy] INFO: Closing spider (finished) 

然後我想這一點,在設置中添加這些代碼.py:

USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36" 
RETRY_ENABLED = True 
RETRY_HTTP_CODES = [400,403,500] 
RETRY_TIMES = 2 
DOWNLOADER_MIDDLEWARES = { 
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,} 

但我仍然得到同樣的錯誤。我不知道該怎麼辦。那麼我做錯了什麼部分,我應該怎樣修改?

+0

看起來您的問題由'忽略響應<400 https:/ /www.zhihu.com/>:HTTP狀態碼未被處理或不被允許。看看[這個問題](http://stackoverflow.com/questions/32779766/auth-failing-999-http-status-code-is-not-handled-or-not-allowed)。您還需要檢查您的請求,因爲該請求可能由400響應表示。 – danielunderwood

回答

1

當提供無效的CSRF令牌時,有時會返回狀態碼400。每次訪問頁面時,CSRF令牌都會發生更改,看起來您已經硬編碼了靜態令牌。您的腳本將需要使用登錄表單向頁面發出初始請求,將CSRF令牌保存在變量中,然後登錄。