2017-04-15 37 views
0

我試圖使用蜘蛛爬蟲代碼來獲取一些房地產數據。但它不斷給我這個錯誤:spiders中的參數__init__與蜘蛛爬蟲

Traceback (most recent call last):

File "//anaconda/lib/python2.7/site-packages/twisted/internet/defer.py", line 1301, in _inlineCallbacks result = g.send(result)

File "//anaconda/lib/python2.7/site-packages/scrapy/crawler.py", line 90, in crawl six.reraise(*exc_info)

File "//anaconda/lib/python2.7/site-packages/scrapy/crawler.py", line 71, in crawl self.spider = self._create_spider(*args, **kwargs)

File "//anaconda/lib/python2.7/site-packages/scrapy/crawler.py", line 94, in _create_spider return self.spidercls.from_crawler(self, *args, **kwargs)

File "//anaconda/lib/python2.7/site-packages/scrapy/spiders/crawl.py", line 96, in from_crawler spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)

File "//anaconda/lib/python2.7/site-packages/scrapy/spiders/init.py", line 50, in from_crawler spider = cls(*args, **kwargs)

TypeError: init() takes exactly 3 arguments (1 given)

這裏被定義爬蟲的代碼:

class RealestateSpider(scrapy.spiders.CrawlSpider): 

    ###Real estate web crawler 
    name = 'buyrentsold' 
    allowed_domains = ['realestate.com.au'] 

    def __init__(self, command, search): 
     search = re.sub(r'\s+', '+', re.sub(',+', '%2c', search)).lower() 
     url = '/{0}/in-{{0}}{{{{0}}}}/list-{{{{1}}}}'.format(command) 
     start_url = 'http://www.{0}{1}' 
     start_url = start_url.format(
       self.allowed_domains[0], url.format(search) 
     ) 
     self.start_urls = [start_url.format('', 1)] 
     extractor = scrapy.linkextractors.sgml.SgmlLinkExtractor(
       allow=url.format(re.escape(search)).format('.*', '') 
     ) 
     rule = scrapy.spiders.Rule(
       extractor, callback='parse_items', follow=True 
     ) 
     self.rules = [rule] 
     super(RealestateSpider, self).__init__() 

    def parse_items(self, response): 
     ###Parse a page of real estate listings 
     hxs = scrapy.selector.HtmlXPathSelector(response) 
     for i in hxs.select('//div[contains(@class, "listingInfo")]'): 
      item = RealestateItem() 
      path = 'div[contains(@class, "propertyStats")]//text()' 
      item['price'] = i.select(path).extract() 
      vcard = i.select('div[contains(@class, "vcard")]//a') 
      item['address'] = vcard.select('text()').extract() 
      url = vcard.select('@href').extract() 
      if len(url) == 1: 
       item['url'] = 'http://www.{0}{1}'.format(
         self.allowed_domains[0], url[0] 
       ) 
      features = i.select('dl') 
      for field in ('bed', 'bath', 'car'): 
       path = '(@class, "rui-icon-{0}")'.format(field) 
       path = 'dt[contains{0}]'.format(path) 
       path = '{0}/following-sibling::dd[1]'.format(path) 
       path = '{0}/text()'.format(path) 
       item[field] = features.select(path).extract() or 0 
      yield item 

這裏是當erorr上來:

crawler = scrapy.crawler.CrawlerProcess(scrapy.conf.settings) 
sp=RealestateSpider(command, search) 
crawler.crawl(sp) 
crawler.start() 

誰能幫助我有這個問題?謝謝!

回答

1

crawler.crawl()方法需要蜘蛛作爲參數,其中代碼中提供了一個蜘蛛對象。

有這樣做的權利的幾種方法,但最直接的方法是簡單地延長蜘蛛類:

class MySpider(Spider): 
    command = None 
    search = None 

    def __init__(self): 
     # do something with self.command and self.search 
     super(RealestateSpider, self).__init__() 

然後:

crawler = scrapy.crawler.CrawlerProcess(scrapy.conf.settings) 
class MySpider(RealestateSpider): 
    command = 'foo' 
    search = 'bar' 
crawler.crawl(MySpider) 
crawler.start() 
相關問題