2016-06-16 62 views
0

我有一個scrapy Crawlspider來解析鏈接和返回的HTML內容就好了。對於JavaScript頁面,我徵募Selenium來訪問「隱藏的」內容。問題是,當硒的運作scrapy解析之外,這是行不通的內部功能parse_items硒裏面scrapy不起作用

from scrapy.spiders import CrawlSpider, Rule, Spider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.linkextractors import LinkExtractor 
from scrapy.linkextractors.sgml import SgmlLinkExtractor 
from craigslist_sample.items import CraigslistReviewItem 
import scrapy 
from selenium import selenium 
from selenium import webdriver 


class MySpider(CrawlSpider): 
    name = "spidername" 
    allowed_domains = ["XXXXX"] 
    start_urls = ['XXXXX'] 

    rules = (
     Rule(LinkExtractor(allow = ('reviews\?page')),callback= 'parse_item'), 
     Rule(LinkExtractor(allow=('.',),deny = ('reviews\?page',)),follow=True)) 

    def __init__(self): 
     #this page loads 
     CrawlSpider.__init__(self) 
     self.selenium = webdriver.Firefox() 
     self.selenium.get('XXXXX') 
     self.selenium.implicitly_wait(30) 


    def parse_item(self, response): 
     #this page doesnt 
     print response.url 
     self.driver.get(response.url) 
     self.driver.implicitly_wait(30) 

     #...do things 
+0

你不是說明它到底是什麼不工作,也不你試過了什麼。我們還挺需要知道'#...做things'實際上做.. – Mobrockers

+0

注意:像您使用「硒」 –

回答

1

你有一些變量的問題。在init方法中,您將瀏覽器實例分配給self.selenium,然後在方法parse_item中使用self.driver作爲瀏覽器實例。我已更新您的腳本。現在試試。

from scrapy.spiders import CrawlSpider, Rule, Spider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.linkextractors import LinkExtractor 
from scrapy.linkextractors.sgml import SgmlLinkExtractor 
from craigslist_sample.items import CraigslistReviewItem 
import scrapy 
from selenium import selenium 
from selenium import webdriver 


class MySpider(CrawlSpider): 
    name = "spidername" 
    allowed_domains = ["XXXXX"] 
    start_urls = ['XXXXX'] 

    rules = (
     Rule(LinkExtractor(allow = ('reviews\?page')),callback= 'parse_item'), 
     Rule(LinkExtractor(allow=('.',),deny = ('reviews\?page',)),follow=True)) 

    def __init__(self): 
     #this page loads 
     CrawlSpider.__init__(self) 
     self.driver= webdriver.Firefox() 
     self.driver.get('XXXXX') 
     self.driver.implicitly_wait(30) 


    def parse_item(self, response): 
     #this page doesnt 
     print response.url 
     self.driver.get(response.url) 
     self.driver.implicitly_wait(30) 

     #...do things 
0

太棒了!哈桑的答案,我刮導致答案的網址有了更好的瞭解的組合(原來的網站已經種植了一種從未裝「假」的URL)

+0

所以接受哈桑的回答給他的信用,請不要使用模塊名作爲腳本變量名爲他所做的工作。不要張貼你自己的答案。 – JeffC