2017-05-15 46 views
0

我是Scrapy的新手,我寫了一個像下面這樣的抓取工具,但我不知道爲什麼parse_item在解析def時沒有被回調調用。糟糕的回調不起作用

歡迎任何幫助。提前致謝。

class ManualSpider(Spider): 
    name = "manual" 
    allowed_domains = ["https://www.gumtree.com"] 
    start_urls = ['https://www.gumtree.com/flats-houses/london'] 

    def parse_item(self, response): 
     # Create the loader using the response 
     l = ItemLoader(item=StackItem(), response=response) 

     l.add_xpath('title', '//main/div[2]/header/h1/text()', MapCompose(unicode.strip, unicode.title)) 
     l.add_xpath('price', '//header/span/strong/text()', MapCompose(lambda i: i.replace(',', ''), float), 
        re='[,.0-9]+',) 
     l.add_xpath('description', '//p[@itemprop="description"]' 
            '[1]/text()', Join(), MapCompose(unicode.strip)) 
     l.add_xpath('address', '//*[@itemtype="http://schema.org/' 
           'Place"][1]/text()', MapCompose(unicode.strip)) 
     l.add_xpath('location', '//header/strong/span/text()', MapCompose(unicode.strip)) 
     l.add_xpath('image_urls', '//*[@itemprop="image"][1]/@src', MapCompose(
      lambda i: urljoin(response.url, i))) 

     l.add_value('url', response.url) 
     l.add_value('project', "example") 
     l.add_value('spider', self.name) 
     l.add_value('server', socket.gethostname()) 
     l.add_value('date', datetime.datetime.now()) 

     yield l.load_item() 

    def parse(self, response): 

     # Get the next index URLs and yield Requests 
     next_selector = response.xpath('//*[@class="pagination-next"]//@href') 
     for url in next_selector.extract(): 
      yield Request(urljoin(response.url, url)) 

     # Get item URLs and yield Requests 
     item_selector = response.xpath('//div[@id="srp-results"]//article//@href') 
     for url in item_selector.extract(): 
      if url != "": 
       print(urljoin(response.url, url)) 
       yield Request(urljoin(response.url, url), callback=self.parse_item) 

回答

1

它不工作,因爲你給stringcallback="parse_item回調。

您應該給這個函數的一個實例來代替,如下所示:callback=self.parse_item

ALSE刪除「https://開頭」的allowed_domains

+0

沒有區別,他們都沒有工作 – altruistic

+1

儘量去除https://開頭在允許域 –

+0

它的工作原理謝謝 – altruistic

0

變化callback="parse_item"callback=self.parse_item