2012-12-17 20 views
2

我想要抓取需要登錄並使用JavaScript加載數據的頁面。目前我可以使用scrapy成功登錄。但我的蜘蛛無法看到我需要的數據,因爲數據是使用JavaScript加載的。使用Scrapy和Selen颳去需要登錄並使用JavaScript來加載數據的頁面

我做了一些搜索,發現硒可能是一個可能的解決方案。我想用硒來創建瀏覽器並查看頁面。看來我應該使用硒webdriver工具。但我不知道該怎麼做。有誰知道我應該在哪裏以及如何將硒代碼添加到我的蜘蛛?

非常感謝。

#My spider looks like 

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import Request, FormRequest 

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 

from selenium import selenium 
import time 

from login.items import SummaryItem 

class titleSpider(BaseSpider): 
    name = "titleSpider" 
    allowed_domains = ["domain.com"] 
    start_urls = ["https://www.domain.com/login"] 

    # Authentication 
    def parse(self, response): 
     return [FormRequest.from_response(response, 
       formdata={'session_key': 'myusername', 'session_password': 'mypassword'}, 
       callback=self.after_login)] 

    # Request the webpage 
    def after_login(self, response): 
     # check login succeed before going on 
     if "Error" in response.body: 
      print "Login failed" 
     else: 
      print "Login successfully" 
      return Request(url="https://www.domain.com/result1", 
       callback=self.parse_page) # this page has some data loaded using javascript 


    def __init__(self): 
     CrawlSpider.__init__(self) 
     self.verificationErrors = [] 
     # How can I know selenium passes authentication? 
     self.selenium = selenium("localhost", 4444, "*firefox", "https://www.domain.com/result1") 
     print "Starting the Selenium Server!" 
     self.selenium.start() 
     print "Successfully, Started the Selenium Server!" 

    def __del__(self): 
     self.selenium.stop() 
     print self.verificationErrors 
     CrawlSpider.__del__(self) 

    # Parse the page 
    def parse_page(self, response): 

     item = SummaryItem() 
     hxs = HtmlXPathSelector(response) 
     item['name']=hxs.select('//span[@class="name"]/text()').extract() # my spider cannot see the name. 

     # Should I add selenium codes here? Can it load the page that requires authentication? 
     sel= self.selenium 
     sel.open(response.url) 
     time.sleep(4) 
     item['name']=sel.select('//span[@class="name"]/text()').extract() # 

     return item 
+0

我找到了一個解決方法。我使用Selenium webdrive登錄網站並解析網頁。它運作良好! – Olivia

回答

0

你可以嘗試這樣的事情

def __init__(self): 
    BaseSpider.__init__(self) 
    self.selenium = webdriver.Firefox() 

def __del__(self): 
    self.selenium.quit() 
    print self.verificationErrors 

def parse(self, response): 

    # Initialize the webdriver, get login page 
    sel = self.selenium 
    sel.get(response.url) 
    sleep(3)