2016-10-22 82 views
0

我遵循基本的Scrapy登錄。它總是有效,但在這種情況下,我遇到了一些問題。 FormRequest.from_response沒有請求https://www.crowdfunder.com/user/validateLogin,而是始終將有效負載發送到https://www.crowdfunder.com/user/signup。我試着直接請求有效載荷的validateLogin,但它迴應了404錯誤。任何想法來幫助我解決這個問題?提前致謝!!!Scrapy FormRequest不需要重定向鏈接

class CrowdfunderSpider(InitSpider): 
    name = "crowdfunder" 
    allowed_domains = ["crowdfunder.com"] 
    start_urls = [ 
     'http://www.crowdfunder.com/', 
    ] 

    login_page = 'https://www.crowdfunder.com/user/login/' 
    payload = {} 

    def init_request(self): 
     """This function is called before crawling starts.""" 
     return scrapy.Request(url=self.login_page, callback=self.login) 

    def login(self, response): 
     """Generate a login request.""" 
     self.payload = {'email': 'my_email', 
         'password': 'my_password'} 

     # scrapy login 
     return scrapy.FormRequest.from_response(response, formdata=self.payload, callback=self.check_login_response) 

    def check_login_response(self, response): 
     """Check the response returned by a login request to see if we are 
     successfully logged in. 
     """ 
     if 'https://www.crowdfunder.com/user/settings' == response.url: 
      self.log("Successfully logged in. :) :) :)") 
      # start the crawling 
      return self.initialized() 
     else: 
      # login fail 
      self.log("login failed :(:(:(") 

這裏是通過點擊登錄界面,在瀏覽器發送的有效載荷和請求鏈接:

payload and request url sent by clicking login button

這裏是日誌信息:

2016-10-21 21:56:21 [scrapy] INFO: Scrapy 1.1.0 started (bot: crowdfunder_crawl) 
2016-10-21 21:56:21 [scrapy] INFO: Overridden settings: {'AJAXCRAWL_ENABLED': True, 'NEWSPIDER_MODULE': 'crowdfunder_crawl.spiders', 'SPIDER_MODULES': ['crowdfunder_crawl.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'crowdfunder_crawl'} 
2016-10-21 21:56:21 [scrapy] INFO: Enabled extensions: 
['scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats'] 
2016-10-21 21:56:21 [scrapy] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2016-10-21 21:56:21 [scrapy] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 

2016-10-21 21:56:21 [scrapy] INFO: Enabled item pipelines: 
[] 
2016-10-21 21:56:21 [scrapy] INFO: Spider opened 

2016-10-21 21:56:21 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 

2016-10-21 21:56:21 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024 

2016-10-21 21:56:21 [scrapy] DEBUG: Crawled (200) <GET https://www.crowdfunder.com/robots.txt> (referer: None) 

2016-10-21 21:56:21 [scrapy] DEBUG: Redirecting (301) to <GET http://www.crowdfunder.com/user/login> from <GET https://www.crowdfunder.com/user/login/> 

2016-10-21 21:56:22 [scrapy] DEBUG: Redirecting (301) to <GET https://www.crowdfunder.com/user/login> from <GET http://www.crowdfunder.com/user/login> 

2016-10-21 21:56:22 [scrapy] DEBUG: Crawled (200) <GET https://www.crowdfunder.com/user/login> (referer: None) 

2016-10-21 21:56:23 [scrapy] DEBUG: Crawled (200) <POST https://www.crowdfunder.com/user/signup> (referer: https://www.crowdfunder.com/user/login) 

2016-10-21 21:56:23 [crowdfunder] DEBUG: login failed :(:(:(
2016-10-21 21:56:23 [scrapy] INFO: Closing spider (finished) 
2016-10-21 21:56:23 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 1569, 
'downloader/request_count': 5, 
'downloader/request_method_count/GET': 4, 
'downloader/request_method_count/POST': 1, 
'downloader/response_bytes': 16313, 
'downloader/response_count': 5, 
'downloader/response_status_count/200': 3, 
'downloader/response_status_count/301': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 10, 22, 4, 56, 23, 232493), 
'log_count/DEBUG': 7, 
'log_count/INFO': 7, 
'request_depth_max': 1, 
'response_received_count': 3, 
'scheduler/dequeued': 4, 
'scheduler/dequeued/memory': 4, 
'scheduler/enqueued': 4, 
'scheduler/enqueued/memory': 4, 
'start_time': datetime.datetime(2016, 10, 22, 4, 56, 21, 180030)} 
2016-10-21 21:56:23 [scrapy] INFO: Spider closed (finished) 

回答

1

FormRequest.from_response(response)默認使用第一種形式它發現。如果你檢查什麼構成頁有你會看到:

In : response.xpath("//form") 
Out: 
[<Selector xpath='//form' data='<form action="/user/signup" method="post'>, 
<Selector xpath='//form' data='<form action="/user/login" method="POST"'>, 
<Selector xpath='//form' data='<form action="/user/login" method="post"'>] 

所以,你正在尋找的形式不是一日一。修復它的方法是使用許多from_response方法參數之一來指定要使用的表單。

使用formxpath是最靈活和我個人最喜歡的:

In : FormRequest.from_response(response, formxpath='//*[contains(@action,"login")]') 
Out: <POST https://www.crowdfunder.com/user/login> 
+0

真棒!感謝您的幫助!我檢查了/ user/login頁面,但沒有找到表單標籤。它似乎有所有形式在主頁。 –

+0

@波文劉可以澄清一下嗎? '用戶/登錄'頁面似乎重定向到自己兩次然後它contians我留在我的答案3種形式。第二種形式包含所有的輸入字段,並且應該使用FormRequest。 – Granitosaurus

+0

是的,它使用第二種形式。當使用「https://www.crowdfunder.com/user/login」作爲「self.login_page」時,from_response沒有找到使用response.xpath(「// form」)的任何表單項,但我找到了所有你的三個表單項目使用主頁「https://www.crowdfunder.com」,並通過登錄。 –