2016-06-09 47 views
0

我的任務是爲物業站點構建一個刮板,將結果存儲起來供以後處理。所涉及的網站是一個國家網站,不會在單個搜索中獲得所有內容,並且希望在提供結果之前提供一個區域。爲了解決這個問題,我創建了一個使用scrapy的scraper,使用多個開始URL將我直接帶到我感興趣的區域。該網站也是動態填充的,因此我使用selenium在頁面上呈現JavaScript,然後跟蹤直到刮刀完成每個區域的下一個按鈕。 只有存在單個起始網址時,此功能纔有效,但只要有多個網址遇到問題,就會立即生效。最初刮刀工作正常,但是在webdriver完成跟隨「下一個」按鈕到一個區域末尾之前(例如,對於單個區域可能有20個頁面),刮刀只移動到下一個區域(開始URL)刮第一個區域的內容。 我已經廣泛地尋找解決這個問題的辦法,但是我還沒有看到有這個問題的任何人。任何建議將是最受歡迎的。下面的代碼示例如下:使用硒和Scrapy刮取多個開始URL的動態內容

from scrapy.spider     import CrawlSpider 
from scrapy.http     import TextResponse 
from scrapy.selector    import HtmlXPathSelector 
from selenium      import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support  import expected_conditions as EC 
from selenium.common.exceptions  import TimeoutException 
import time 
from selenium      import webdriver 
from selenium      import selenium 
from selenium_spider.items   import DemoSpiderItem 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support  import expected_conditions as EC 
from selenium.common.exceptions  import TimeoutException 
import sys 

class DemoSpider(CrawlSpider): 
    name="Demo" 
    allowed_domains = ['example.com'] 
    start_urls= ["http://www.example.co.uk/locationIdentifier=REGION 1234", 
    "http://www.example.co.uk/property-for-sale/locationIdentifier=REGION 5678"] 

    def __init__(self): 
     self.driver = webdriver.Firefox() 

    def __del__(self): 
     self.selenium.stop() 

    def parse (self, response): 
     self.driver.get(response.url) 


     result = response.xpath('//*[@class="l-searchResults"]') 
     source = 'aTest' 
     while True: 
      try: 
       element = WebDriverWait(self.driver, 10).until(
      EC.element_to_be_clickable((By.CSS_SELECTOR,".pagination-button.pagination-direction.pagination-direction--next")) 
      ) 
       print "Scraping new site --------------->", result 
       print "This is the result----------->", result 
       for properties in result: 
        saleOrRent = properties.xpath('//*[@class = "property-title"]/text()').extract() 
        addresses = properties.xpath('//*[@class="property-address"]/text()').extract() 
        if saleOrRent: 
         saleOrRent = saleOrRent[0] 
         if 'for sale' in saleOrRent: 
          saleOrRent = 'For Sale' 
         elif 'to rent' in saleOrRent: 
          saleOrRent = 'To Rent' 
       for a in addresses: 
        item = DemoSpiderItem() 
        address = a 
        item ["saleOrRent"] = saleOrRent 
        item ["source"] = source 
        item ["address"] = address 
        item ["response"] = response 
        yield item 
       element.click() 
      except TimeoutException: 
        break 
+0

我有相同的問題!你有沒有找到解決方案?我目前也在尋找,如果我遇到一些問題,我會讓你知道。 –

回答

0

我實際上只是玩了一下,結果比我想象的要容易。 您只通過start_urls中的一個初始網址,創建您的手動後續網址的單獨列表,以產生一個手冊Requestparse函數作爲回調函數,並使用計數器訪問manual_urls中正確url的索引以將它傳遞給請求。

通過這種方式,您可以在加載下一個URL時自行決定,例如,你沒有更多的結果。唯一的缺點是在這裏,它是連續的,但是呢...... :-)

見代碼:

import scrapy from scrapy.http.request 
import Request from selenium 
import webdriver from scrapy.selector 
import Selector from products_scraper.items import ProductItem 

class ProductsSpider(scrapy.Spider): 
    name = "products_spider" 
    allowed_domains = ['example.com'] 
    start_urls = ['http://www.example.com/first'] 

    global manual_urls 
    manual_urls = [ 
    'http://www.example.com/second', 
    'http://www.example.com/third' 
    ] 

    global manual_url_index 
    manual_url_index = 0 

    def __init__(self): 
     self.driver = webdriver.Firefox() 

    def parse(self, response): 

     self.driver.get(response.url) 

     hasPostings = True 

     while hasPostings: 
      next = self.driver.find_element_by_xpath('//dd[@class="next-page"]/a') 

      try: 
       next.click() 
       self.driver.set_script_timeout(30) 
       products = self.driver.find_elements_by_css_selector('.products-list article') 

       if(len(products) == 0): 
        if(manual_url_index < len(manual_urls)): 
         yield Request(manual_urls[manual_url_index], 
          callback=self.parse) 
         global manual_url_index 
         manual_url_index += 1 

        hasPostings = False 

       for product in products: 
        item = ProductItem() 
        # store product info here 
        yield item 

      except Exception, e: 
       print str(e) 
       break 



     def spider_closed(self, spider): 
      self.driver.quit()