2015-05-07 26 views
1

我一時間知道,您需要使用像硒這樣的webtoolkits來自動化刮擦。如何將硒與scrapy一起使用來實現過程自動化?

我怎樣才能點擊谷歌Play商店的下一個按鈕,以刮擦我的大學目的評論!

import scrapy 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.selector import Selector 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from urlparse import urljoin 
from selenium import webdriver 
import time 


class Product(scrapy.Item): 
    title = scrapy.Field() 


class FooSpider(CrawlSpider): 
    name = 'foo' 

    start_urls = ["https://play.google.com/store/apps/details?id=com.gaana&hl=en"] 

    def __init__(self, *args, **kwargs): 
     super(FooSpider, self).__init__(*args, **kwargs) 
     self.download_delay = 0.25 
     self.browser = webdriver.Chrome(executable_path="C:\chrm\chromedriver.exe") 
     self.browser.implicitly_wait(60) # 

    def parse(self,response): 
     self.browser.get(response.url) 
     sites = response.xpath('//div[@class="single-review"]/div[@class="review-header"]') 
     items = [] 
     for i in range(0,200): 
      time.sleep(20) 
      button = self.browser.find_element_by_xpath("/html/body/div[4]/div[6]/div[1]/div[2]/div[2]/div[1]/div[2]/button[1]/div[2]/div/div") 
      button.click() 
      self.browser.implicitly_wait(30)  
      for site in sites: 
       item = Product() 

       item['title'] = site.xpath('.//div[@class="review-info"]/span[@class="author-name"]/a/text()').extract() 
       yield item 

我已更新我的代碼,它只是一次又一次地給我重複40個項目。我的for循環出了什麼問題?

看來,它也被更新不傳遞到的XPath這就是爲什麼它與同40項

回答

3

回來時,我會做這樣的事情的源代碼:

from scrapy import CrawlSpider 
from selenium import webdriver 
import time 

class FooSpider(CrawlSpider): 
    name = 'foo' 
    allow_domains = 'foo.com' 
    start_urls = ['foo.com'] 

    def __init__(self, *args, **kwargs): 
     super(FooSpider, self).__init__(*args, **kwargs) 
     self.download_delay = 0.25 
     self.browser = webdriver.Firefox() 
     self.browser.implicitly_wait(60) 

    def parse_foo(self.response): 
     self.browser.get(response.url) # load response to the browser 
     button = self.browser.find_element_by_xpath("path") # find 
     # the element to click to 
     button.click() # click 
     time.sleep(1) # wait until the page is fully loaded 
     source = self.browser.page_source # get source of the loaded page 
     sel = Selector(text=source) # create a Selector object 
     data = sel.xpath('path/to/the/data') # select data 
     ... 

不過,最好不要等待一段固定的時間。因此,而不是time.sleep(1),您可以使用此處描述的方法之一http://www.obeythetestinggoat.com/how-to-get-selenium-to-wait-for-page-load-after-a-click.html

+3

它仍然沒有在瀏覽器中加載url –

+1

瀏覽器打開,但沒有輸入url –

+0

嘗試'webdriver.Chrome()'而不是'webdriver.Firefox()'。 Firefox也不適用於我的情況。 – Timofey

相關問題