0
我是一個完整的Python新手,但我需要在谷歌分析中抓取特定頁面,谷歌有兩個頁面的split sign-in process,我不知道如何使它與scrapy的FormRequest一起工作。Scrapy - 通過Google服務驗證
我試圖登錄到登錄到Gmail作爲用下面的代碼測試:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
class LoginSpider(BaseSpider):
name = 'super'
start_urls = ['https://accounts.google.com/ServiceLogin?service=mail&continue=https://mail.google.com/mail/&hl=fr#identifier']
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'Email': '[email protected]', 'Passwd': 'password example'},
callback=self.after_login)]
def after_login(self, response):
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# We've successfully authenticated, let's have some fun!
else:
return Request(url="https://mail.google.com/mail/u/0/#inbox",
callback=self.parse_tastypage)
def parse_tastypage(self, response):
sel = Selector(response)
item = Item()
item ["Test"] = sel.xpath("//h1/text()").extract()
yield item
,但它沒有工作,這是我的日誌文件:
2016-03-27 10:30:19 [scrapy] INFO: Spider opened
2016-03-27 10:30:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0
items (at 0 items/min)
2016-03-27 10:30:19 [scrapy] DEBUG: Telnet console listening on 148.0.0.1:6023
2016-03-27 10:30:19 [scrapy] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?service=mail&continue=https://mail.google.com/mail/&hl=fr#identifier> (referer: None)
2016-03-27 10:30:24 [scrapy] DEBUG: Crawled (200) <POST https://accounts.google.com/AccountLoginInfo> (referer: https://accounts.google.com/ServiceLogin?service=mail&continue=https://mail.google.com/mail/&hl=fr)
2016-03-27 10:30:25 [scrapy] DEBUG: Redirecting (302) to <GET https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=false&continue=https://mail.google.com/mail/&ss=1&scc=1<mpl=default<mplcache=2&emr=1&osid=1> from <GET https://mail.google.com/mail/u/0/#inbox>
2016-03-27 10:30:30 [scrapy] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=false&continue=https://mail.google.com/mail/&ss=1&scc=1<mpl=default<mplcache=2&emr=1&osid=1> (referer: https://accounts.google.com/AccountLoginInfo)
2016-03-27 10:30:30 [scrapy] ERROR: Spider error processing <GET https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=false&continue=https://mail.google.com/mail/&ss=1&scc=1<mpl=default<mplcache=2&emr=1&osid=1> (referer: https://accounts.google.com/AccountLoginInfo)
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or())
File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or() if _filter(r))
File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or() if _filter(r))
File "/Users/machine/super/super/spiders/mySuper.py", line 26, in parse_tastypage
sel = Selector(response)
NameError: global name 'Selector' is not defined
2016-03-27 10:30:30 [scrapy] INFO: Closing spider (finished)
2016-03-27 10:30:30 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1874,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 197446,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 3, 27, 14, 30, 30, 741077),
'log_count/DEBUG': 5,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'log_count/WARNING': 1,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'spider_exceptions/NameError': 1,
'start_time': datetime.datetime(2016, 3, 27, 14, 30, 19, 82107)}
2016-03-27 10:30:30 [scrapy] INFO: Spider closed (finished)
一個想法我應該如何繼續?
謝謝
謝謝Padraic你的代碼幫了很多!我現在可以登錄到谷歌分析,但我再次卡住了,我似乎無法發送所需的http請求來獲取我需要的數據。我已經在這裏發佈了另一個問題:http://stackoverflow.com/questions/36273347/scraping-google-analytics-with-scrapy –
不用擔心,我即將退出的夜晚,但會看看我,如果你在此之前不要得到答案。 –