2015-11-15 164 views
1

我想抓this site。我使用Scrapy的Request,但它不工作,代碼顯示不尋常的行爲。下面是我的代碼:Scrapy回調請求不起作用

 # -*- coding: utf-8 -*- 

     from scrapy.spiders import BaseSpider 
     from scrapy.selector import Selector 
     from scrapy.http import Request,Response 
     import re 
     import csv 
     import time 

     from selenium import webdriver 



     class ColdWellSpider(BaseSpider): 
      name = "cwspider" 
      allowed_domains = ["coldwellbankerhomes.com"] 
      #start_urls = [''.join(row).strip() for row in csv.reader(open("remaining_links.csv"))] 
      #start_urls = ['https://www.coldwellbankerhomes.com/fl/boynton-beach/5451-verona-drive-unit-d/pid_9266204/'] 
      start_urls = ['https://www.coldwellbankerhomes.com/fl/miami-dade-county/kvc-17_1,17_3,17_2,17_8/incl-22/'] 

      def parse(self,response): 

        #browser = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--load-images=false']) 
        browser = webdriver.Firefox() 
        browser.maximize_window() 
        browser.get(response.url) 
        time.sleep(5) 

        #to extract all the links from a page and send request to those links 
        self.getlink(response) 

        #for clicking the load more button in the page 
        while True: 
         try: 
          browser.find_element_by_class_name('search-results-load-more').find_element_by_tag_name('a').click() 
          time.sleep(3) 
          self.getlink(browser) 

         except: 
          break 

      def getlink(self,response): 
       print 'hhelo' 

       c = open('data_getlink.csv', 'a') 
       d = csv.writer(c, lineterminator='\n') 
       print 'hello2' 
       listclass = response.xpath('//div[@class="list-items"]/div[contains(@id,"snapshot")]') 

       for l in listclass: 
         link = 'http://www.coldwellbankerhomes.com/'+''.join(l.xpath('./h2/a/@href').extract()) 

         d.writerow([link]) 
         yield Request(url = str(link),callback=self.parse_link) 


      #callback function of Request 
      def parse_link(self,response): 
        b = open('data_parselink.csv', 'a') 
        a = csv.writer(b, lineterminator='\n') 
        a.writerow([response.url]) 

釷的問題是與yield Request(url = str(link),callback=self.parse_link)。當我刪除這行代碼時,getlink函數被完美調用並且鏈接被寫入data_getlink.csv文件。但是,如果代碼中存在上面的代碼行,則不會調用整個getlink函數,因此也會調用回調函數。任何幫助將是非常有用的

回答

0

問題是與yield聲明。

yield語句存在時,getlink函數變成迭代函數,所以在請求第一個迭代項之前,它的主體不會被執行。

爲了解決這個問題,請致電getlink功能是這樣的:

for i in self.getlink(browser): # actually, browser or response here? 
    yield i 

或python3:

yield from self.getlink(browser)