2015-02-09 40 views
3

我想抓取數據從一個頁面有很多AJAX調用和JavaScript執行呈現網頁。所以我想用硒與scrapy做到這一點。作案手法如下:Scrapy與硒爲一個網頁需要認證

  1. 添加的登錄頁面URL到scrapy start_urls列表

  2. 從應對方法使用formrequest後的用戶名和密碼即可進行身份驗證。

  3. 一旦登錄,請求所需頁面被刮取
  4. 將此響應傳遞給Selenium Webdriver以單擊頁面上的按鈕。
  5. 單擊按鈕並呈現新網頁後,捕獲結果。

,我有迄今的代碼如下:

from scrapy.spider import BaseSpider 
from scrapy.http import FormRequest, Request 
from selenium import webdriver 
import time 


class LoginSpider(BaseSpider): 
    name = "sel_spid" 
    start_urls = ["http://www.example.com/login.aspx"] 


    def __init__(self): 
     self.driver = webdriver.Firefox() 


    def parse(self, response): 
     return FormRequest.from_response(response, 
       formdata={'User': 'username', 'Pass': 'password'}, 
       callback=self.check_login_response) 

    def check_login_response(self, response): 
     if "Log Out" in response.body: 
      self.log("Successfully logged in") 
      scrape_url = "http://www.example.com/authen_handler.aspx?SearchString=DWT+%3E%3d+500" 
      yield Request(url=scrape_url, callback=self.parse_page) 
     else: 
      self.log("Bad credentials") 

    def parse_page(self, response): 
     self.driver.get(response.url) 
     next = self.driver.find_element_by_class_name('dxWeb_pNext') 
     next.click() 
     time.sleep(2) 
     # capture the html and store in a file 

的2個路障我擊中至今是:

  1. 第4步不work.Whenever硒開放Firefox的窗口,它總是在登錄屏幕上,並不知道如何超越它。

  2. 我不知道該如何實現第5步

任何幫助,將不勝感激

+1

從理論上講,你可以通過scrapy響應餅乾和'add_cookie'方法的驅動程序,請參見:HTTP ://stackoverflow.com/questions/16563073/how-to-pass-scrapy-login-cookies-to-selenium和http://stackoverflow.com/questions/19082248/python-selenium-rc-create-cookie。但是,爲什麼不像Eric所說的那樣使用'selenium'登錄?謝謝。 – alecxe 2015-02-10 01:02:16

+0

我可以做到這一點,但我不想失去在scrapy引擎蓋下運行的令人敬畏的扭曲代碼。我計劃在我通過身份驗證並且希望以非阻塞的方式進行操作時抓取大量URL。 。我的思維錯誤? – Amistad 2015-02-10 04:12:19

回答

2

我不相信你可以scrapy請求和硒像之間切換。您需要使用硒登錄到該網站,而不是收益請求()。您使用scrapy創建的登錄會話不會轉移到硒會話。下面是一個例子(元素IDS/XPath的將是你的不同):

scrape_url = "http://www.example.com/authen_handler.aspx" 
    driver.get(scrape_url) 
    time.sleep(2) 
    username = self.driver.find_element_by_id("User") 
    password = self.driver.find_element_by_name("Pass") 
    username.send_keys("your_username") 
    password.send_keys("your_password") 
    self.driver.find_element_by_xpath("//input[@name='commit']").click() 

那麼你可以做:

time.sleep(2) 
    next = self.driver.find_element_by_class_name('dxWeb_pNext').click() 
    time.sleep(2) 

編輯:如果你需要支持JavaScript並擔心速度/非阻塞,您可以使用http://splash.readthedocs.org/en/latest/index.html這應該做的伎倆。

http://splash.readthedocs.org/en/latest/scripting-ref.html#splash-add-cookie對路過一個cookie的細節,你應該能夠從scrapy通過它,但我沒有這麼做過。

+0

謝謝sooooooo多!奇蹟般有效! – rzaaeeff 2015-06-28 20:10:09

0

日誌與scrapy API第一

# call scrapy post request with after_login as callback 
    return FormRequest.from_response(
     response, 
     # formxpath=formxpath, 
     formdata=formdata, 
     callback=self.browse_files 
    ) 

通會話硒鉻司機

# logged in previously with scrapy api 
def browse_files(self, response): 
    print "browse files for: %s" % (response.url) 

    # response.headers   
    cookie_list2 = response.headers.getlist('Set-Cookie') 
    print cookie_list2 

    self.driver.get(response.url) 
    self.driver.delete_all_cookies() 

    # extract all the cookies 
    for cookie2 in cookie_list2: 
     cookies = map(lambda e: e.strip(), cookie2.split(";")) 

     for cookie in cookies: 
      splitted = cookie.split("=") 
      if len(splitted) == 2: 
       name = splitted[0] 
       value = splitted[1] 
       #for my particular usecase I needed only these values 
       if name == 'csrftoken' or name == 'sessionid': 
        cookie_map = {"name": name, "value": value} 
       else: 
        continue 
      elif len(splitted) == 1: 
       cookie_map = {"name": splitted[0], "value": ''} 
      else: 
       continue 

      print "adding cookie" 
      print cookie_map 
      self.driver.add_cookie(cookie_map) 

    self.driver.get(response.url) 

    # check if we have successfully logged in 
    files = self.wait_for_elements_to_be_present(By.XPATH, "//*[@id='files']", response) 
    print files