2016-03-28 89 views
0

我是一個完整的Python新手,但我需要在谷歌分析中抓取特定頁面,谷歌有兩個頁面的split sign-in process,我不知道如何使它與scrapy的FormRequest一起工作。Scrapy - 通過Google服務驗證

我試圖登錄到登錄到Gmail作爲用下面的代碼測試:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import FormRequest, Request 

class LoginSpider(BaseSpider): 
name = 'super' 
start_urls = ['https://accounts.google.com/ServiceLogin?service=mail&continue=https://mail.google.com/mail/&hl=fr#identifier'] 

def parse(self, response): 
    return [FormRequest.from_response(response, 
       formdata={'Email': '[email protected]', 'Passwd': 'password example'}, 

       callback=self.after_login)] 

def after_login(self, response): 
    if "authentication failed" in response.body: 
    self.log("Login failed", level=log.ERROR) 
    return 
# We've successfully authenticated, let's have some fun! 
    else: 
    return Request(url="https://mail.google.com/mail/u/0/#inbox", 
      callback=self.parse_tastypage) 


def parse_tastypage(self, response): 
    sel = Selector(response) 
    item = Item() 
    item ["Test"] = sel.xpath("//h1/text()").extract() 
    yield item 

,但它沒有工作,這是我的日誌文件:

2016-03-27 10:30:19 [scrapy] INFO: Spider opened 
2016-03-27 10:30:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 

items (at 0 items/min) 
2016-03-27 10:30:19 [scrapy] DEBUG: Telnet console listening on 148.0.0.1:6023 
2016-03-27 10:30:19 [scrapy] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?service=mail&continue=https://mail.google.com/mail/&hl=fr#identifier> (referer: None) 
2016-03-27 10:30:24 [scrapy] DEBUG: Crawled (200) <POST https://accounts.google.com/AccountLoginInfo> (referer: https://accounts.google.com/ServiceLogin?service=mail&continue=https://mail.google.com/mail/&hl=fr) 
2016-03-27 10:30:25 [scrapy] DEBUG: Redirecting (302) to <GET https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=false&continue=https://mail.google.com/mail/&ss=1&scc=1&ltmpl=default&ltmplcache=2&emr=1&osid=1> from <GET https://mail.google.com/mail/u/0/#inbox> 
2016-03-27 10:30:30 [scrapy] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=false&continue=https://mail.google.com/mail/&ss=1&scc=1&ltmpl=default&ltmplcache=2&emr=1&osid=1> (referer: https://accounts.google.com/AccountLoginInfo) 
2016-03-27 10:30:30 [scrapy] ERROR: Spider error processing <GET https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=false&continue=https://mail.google.com/mail/&ss=1&scc=1&ltmpl=default&ltmplcache=2&emr=1&osid=1> (referer: https://accounts.google.com/AccountLoginInfo) 
Traceback (most recent call last): 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/defer.py", line 102, in iter_errback 
    yield next(it) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output 
    for x in result: 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr> 
    return (_set_referer(r) for r in result or()) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "/Users/machine/super/super/spiders/mySuper.py", line 26, in parse_tastypage 
    sel = Selector(response) 
NameError: global name 'Selector' is not defined 
2016-03-27 10:30:30 [scrapy] INFO: Closing spider (finished) 
2016-03-27 10:30:30 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 1874, 
'downloader/request_count': 4, 
'downloader/request_method_count/GET': 3, 
'downloader/request_method_count/POST': 1, 
'downloader/response_bytes': 197446, 
'downloader/response_count': 4, 
'downloader/response_status_count/200': 3, 
'downloader/response_status_count/302': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 3, 27, 14, 30, 30, 741077), 
'log_count/DEBUG': 5, 
'log_count/ERROR': 1, 
'log_count/INFO': 7, 
'log_count/WARNING': 1, 
'request_depth_max': 2, 
'response_received_count': 3, 
'scheduler/dequeued': 4, 
'scheduler/dequeued/memory': 4, 
'scheduler/enqueued': 4, 
'scheduler/enqueued/memory': 4, 
'spider_exceptions/NameError': 1, 
'start_time': datetime.datetime(2016, 3, 27, 14, 30, 19, 82107)} 
2016-03-27 10:30:30 [scrapy] INFO: Spider closed (finished) 

一個想法我應該如何繼續?

謝謝

回答

2

你缺少進口使你的代碼中的錯誤,你有沒有進口的選擇,也不項目:

from scrapy.selector import Selector 
from scrapy import Item 

如果你看一下輸出,你可以清楚地看到:

NameError: global name 'Selector' is not defined 

如果你還沒有導入,你會看到完全相同的項目。

你也有level=log.ERROR其中log沒有定義或任何進口的,你可能想logging.ERROR所以你需要進口記錄:

import logging 


level=logging.ERROR 

這一切都非常基本的東西,我會建議你在嘗試去任何靠近課程或任何類型的複雜代碼之前,通過tutorial,如果你對基礎知識沒有很好的理解,那麼生活將比它所需要的更困難。

這是一個工作實現:

from scrapy.spider import BaseSpider 
from scrapy.selector import Selector 
from scrapy.http import FormRequest, Request 
import logging 
from scrapy import Field 
from scrapy import Item 


class Product(Item): 
    Test = Field() 


class LoginSpider(BaseSpider): 
    name = 'super' 
    start_urls = ['https://accounts.google.com/ServiceLogin?service=mail&continue=https://mail.google.com/mail/&hl=fr#identifier'] 

    def parse(self, response): 
     return [FormRequest.from_response(response, 
        formdata={'Email': '[email protected]', 'Passwd': 'pass'}, 

        callback=self.after_login)] 

    def after_login(self, response): 
     if "authentication failed" in response.body: 
     self.log("Login failed", level=logging.ERROR) 
     return 
    # We've successfully authenticated, let's have some fun! 
    print("Login Successful!!") 
    return Request(url="https://mail.google.com/mail/u/0/#inbox", 
       callback=self.parse_tastypage) 


    def parse_tastypage(self, response): 
     item = Product() 
     item ["Test"] = response.xpath("//h1/text()").extract() 
     yield item 

樣品運行使用我的登錄:

DEBUG: Crawled (200) <POST https://accounts.google.com/AccountLoginInfo> (referer: https://accounts.google.com/ServiceLogin?service=mail&continue=https://mail.google.com/mail/&hl=fr) 
Login Successful!! 
2016-03-28 02:13:27 [scrapy] DEBUG: Redirecting (302) to <GET https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=false&continue=https://mail.google.com/mail/&ss=1&scc=1&ltmpl=default&ltmplcache=2&emr=1&osid=1> from <GET https://mail.google.com/mail/u/0/#inbox> 

我還沒有證實它的實際工作,我只是用你自己的邏輯,無論如何,是如何使用scrapy的一個工作示例。

+0

謝謝Padraic你的代碼幫了很多!我現在可以登錄到谷歌分析,但我再次卡住了,我似乎無法發送所需的http請求來獲取我需要的數據。我已經在這裏發佈了另一個問題:http://stackoverflow.com/questions/36273347/scraping-google-analytics-with-scrapy –

+0

不用擔心,我即將退出的夜晚,但會看看我,如果你在此之前不要得到答案。 –