2
我想要抓取需要登錄並使用JavaScript加載數據的頁面。目前我可以使用scrapy成功登錄。但我的蜘蛛無法看到我需要的數據,因爲數據是使用JavaScript加載的。使用Scrapy和Selen颳去需要登錄並使用JavaScript來加載數據的頁面
我做了一些搜索,發現硒可能是一個可能的解決方案。我想用硒來創建瀏覽器並查看頁面。看來我應該使用硒webdriver工具。但我不知道該怎麼做。有誰知道我應該在哪裏以及如何將硒代碼添加到我的蜘蛛?
非常感謝。
#My spider looks like
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request, FormRequest
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from selenium import selenium
import time
from login.items import SummaryItem
class titleSpider(BaseSpider):
name = "titleSpider"
allowed_domains = ["domain.com"]
start_urls = ["https://www.domain.com/login"]
# Authentication
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'session_key': 'myusername', 'session_password': 'mypassword'},
callback=self.after_login)]
# Request the webpage
def after_login(self, response):
# check login succeed before going on
if "Error" in response.body:
print "Login failed"
else:
print "Login successfully"
return Request(url="https://www.domain.com/result1",
callback=self.parse_page) # this page has some data loaded using javascript
def __init__(self):
CrawlSpider.__init__(self)
self.verificationErrors = []
# How can I know selenium passes authentication?
self.selenium = selenium("localhost", 4444, "*firefox", "https://www.domain.com/result1")
print "Starting the Selenium Server!"
self.selenium.start()
print "Successfully, Started the Selenium Server!"
def __del__(self):
self.selenium.stop()
print self.verificationErrors
CrawlSpider.__del__(self)
# Parse the page
def parse_page(self, response):
item = SummaryItem()
hxs = HtmlXPathSelector(response)
item['name']=hxs.select('//span[@class="name"]/text()').extract() # my spider cannot see the name.
# Should I add selenium codes here? Can it load the page that requires authentication?
sel= self.selenium
sel.open(response.url)
time.sleep(4)
item['name']=sel.select('//span[@class="name"]/text()').extract() #
return item
我找到了一個解決方法。我使用Selenium webdrive登錄網站並解析網頁。它運作良好! – Olivia