2017-10-14 128 views
0

我在嘗試登錄到以下網站:http://go.galegroup.com.proxy-um.researchport.umd.edu/ps/eToc.do?docId=0PQC&userGroupName=umd_um&action=DO_BROWSE_ETOC&inPS=true&prodId=GVRL&etocId=GALE%7CCX2830999001&isDownLoadOptionDisabled=true。這最初發送給我一個重定向頁面,似乎使用Javascript重定向。我無法讓Scrapy在這裏遵循重定向,因此我現在使用splash-scrapy重定向到登錄頁面。目前我能夠進入登錄頁面,但是當我嘗試使用scrapy.FormRequest.from_response時,我收到的回覆不是正確的頁面。Scrapy使用Javascript登錄轉發網站

我的Scrapy程序:從Scrapy

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy_splash import SplashRequest 

class DbioSpider(scrapy.Spider): 
    name = 'dbio' 
    start_urls = ['http://go.galegroup.com.proxy-um.researchport.umd.edu/ps/eToc.do?docId=0PQC&userGroupName=umd_um&action=DO_BROWSE_ETOC&inPS=true&prodId=GVRL&etocId=GALE%7CCX2830999001&isDownLoadOptionDisabled=true'] 

    def start_requests(self): 
     for url in self.start_urls: 
      yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait': 0.5}) 

    def parse(self, response): 
     with open('login.html', 'w') as f: 
      f.write(response.body) 
     return scrapy.FormRequest.from_response(response, formdata={'username': '???', 'password': '???'}, 
               callback=self.after_login) 

    def after_login(self, response): 
     with open('after_login.html', 'w') as f: 
      f.write(response.body) 

輸出:

2017-10-14 09:13:32 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: bio) 
2017-10-14 09:13:32 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'bio.spiders', 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'SPIDER_MODULES': ['bio.spiders'], 'BOT_NAME': 'bio', 'EDITOR': 'emacs', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage'} 
2017-10-14 09:13:32 [scrapy.middleware] INFO: Enabled extensions: 
['scrapy.extensions.memusage.MemoryUsage', 
'scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats'] 
2017-10-14 09:13:32 [scrapy.middleware] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy_splash.SplashCookiesMiddleware', 
'scrapy_splash.SplashMiddleware', 
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2017-10-14 09:13:32 [scrapy.middleware] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy_splash.SplashDeduplicateArgsMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2017-10-14 09:13:32 [scrapy.middleware] INFO: Enabled item pipelines: 
[] 
2017-10-14 09:13:32 [scrapy.core.engine] INFO: Spider opened 
2017-10-14 09:13:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-10-14 09:13:32 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
2017-10-14 09:13:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://go.galegroup.com.proxy-um.researchport.umd.edu/ps/eToc.do?docId=0PQC&userGroupName=umd_um&action=DO_BROWSE_ETOC&inPS=true&prodId=GVRL&etocId=GALE%7CCX2830999001&isDownLoadOptionDisabled=true via http://192.168.1.7:8050/render.html> (referer: None) 
2017-10-14 09:13:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.searchum.umd.edu/search?site=UMCP&client=UMCP&proxystylesheet=UMCP&output=xml_no_dtd&as_oq=site%3A&q=Search+UMD.edu&search+button=Search&username=???&password=???> (referer: http://go.galegroup.com.proxy-um.researchport.umd.edu/ps/eToc.do?docId=0PQC&userGroupName=umd_um&action=DO_BROWSE_ETOC&inPS=true&prodId=GVRL&etocId=GALE%7CCX2830999001&isDownLoadOptionDisabled=true) 
2017-10-14 09:13:33 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-10-14 09:13:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 1268, 
'downloader/request_count': 2, 
'downloader/request_method_count/GET': 1, 
'downloader/request_method_count/POST': 1, 
'downloader/response_bytes': 18230, 
'downloader/response_count': 2, 
'downloader/response_status_count/200': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 10, 14, 14, 13, 33, 929932), 
'log_count/DEBUG': 3, 
'log_count/INFO': 7, 
'memusage/max': 50655232, 
'memusage/startup': 50655232, 
'request_depth_max': 1, 
'response_received_count': 2, 
'scheduler/dequeued': 3, 
'scheduler/dequeued/memory': 3, 
'scheduler/enqueued': 3, 
'scheduler/enqueued/memory': 3, 
'splash/render.html/request_count': 1, 
'splash/render.html/response_count/200': 1, 
'start_time': datetime.datetime(2017, 10, 14, 14, 13, 32, 187173)} 
2017-10-14 09:13:33 [scrapy.core.engine] INFO: Spider closed (finished) 

我可以提供,如果這將有助於該返回的頁面。

回答

0

嘗試初始請求到您的鏈接: https://login.proxy-um.researchport.umd.edu/login?url=http://go.galegroup.com/ps/eToc.do?docId=0PQC&userGroupName=umd_um&action=DO_BROWSE_ETOC&inPS=true&prodId=GVRL&etocId=GALE|CX2830999001&isDownLoadOptionDisabled=true 並使用scrapy.FormRequest.from_response()沒有任何formdata。 此之後,你會重定向到https://shib.idm.umd.edu/shibboleth-idp/profile/SAML2/POST/SSO?execution=e2s1

有些事情是這樣的:

def start_requests(self): 
    yield scrapy.Request(self.start_urls[0]) 

def parse(self, response): 
    with open('prepare.html', 'w') as f: 
     f.write(response.body) 
    return scrapy.FormRequest.from_response(response, callback=self.prepare_login) 

def prepare_login(self, response): 
    with open('login.html', 'w') as f: 
     f.write(response.body) 
    return scrapy.FormRequest.from_response(response, formdata={'username': '???', 'password': '???'}, 
              callback=self.after_login) 

def after_login(self, response): 
    with open('after_login.html', 'w') as f: 
     f.write(response.body) 
+0

不幸的是這沒有奏效。我已經開始使用硒來解決我的問題,但無論如何感謝。 – josh