我有一個scrapy Crawlspider來解析鏈接和返回的HTML內容就好了。對於JavaScript頁面,我徵募Selenium來訪問「隱藏的」內容。問題是,當硒的運作scrapy解析之外,這是行不通的內部功能parse_items硒裏面scrapy不起作用
from scrapy.spiders import CrawlSpider, Rule, Spider
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from craigslist_sample.items import CraigslistReviewItem
import scrapy
from selenium import selenium
from selenium import webdriver
class MySpider(CrawlSpider):
name = "spidername"
allowed_domains = ["XXXXX"]
start_urls = ['XXXXX']
rules = (
Rule(LinkExtractor(allow = ('reviews\?page')),callback= 'parse_item'),
Rule(LinkExtractor(allow=('.',),deny = ('reviews\?page',)),follow=True))
def __init__(self):
#this page loads
CrawlSpider.__init__(self)
self.selenium = webdriver.Firefox()
self.selenium.get('XXXXX')
self.selenium.implicitly_wait(30)
def parse_item(self, response):
#this page doesnt
print response.url
self.driver.get(response.url)
self.driver.implicitly_wait(30)
#...do things
你不是說明它到底是什麼不工作,也不你試過了什麼。我們還挺需要知道'#...做things'實際上做.. – Mobrockers
注意:像您使用「硒」 –