2016-02-15 59 views
1

無法理解儘管設置規則,scrapy中的CrawlSpider無法進行分頁的原因。Scrapy CrawlSpider和LinkExtractor規則不適用於分頁

但是,如果將start_url更改爲http://bitcoin.travel/listing-category/bitcoin-hotels-and-travel/並註釋掉parse_start_url,我會爲上述頁面獲取更多項目。

我的目標是刮所有類別。請任何想法是什麼做錯了?

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 

from bitcointravel.items import BitcointravelItem 



class BitcoinSpider(CrawlSpider): 
    name = "bitcoin" 
    allowed_domains = ["bitcoin.travel"] 
    start_urls = [ 
     "http://bitcoin.travel/categories/" 
    ] 

    rules = (

     # Extract links matching 'item.php' and parse them with the spider's method parse_item 
     Rule(LinkExtractor(allow=('.+/page/\d+/$'), restrict_xpaths=('//a[@class="next page-numbers"]'),), 
      callback='parse_items', follow=True), 
    ) 

    def parse_start_url(self, response): 
     for sel in response.xpath("//ul[@class='maincat-list']/li"): 
      url = sel.xpath('a/@href').extract()[0] 
      if url == 'http://bitcoin.travel/listing-category/bitcoin-hotels-and-travel/': 
      # url = 'http://bitcoin.travel/listing-category/bitcoin-hotels-and-travel/' 
       yield scrapy.Request(url, callback=self.parse_items) 


    def parse_items(self, response): 
     self.logger.info('Hi, this is an item page! %s', response.url) 
     for sel in response.xpath("//div[@class='grido']"): 
      item = BitcointravelItem() 
      item['name'] = sel.xpath('a/@title').extract() 
      item['website'] = sel.xpath('a/@href').extract() 
      yield item 

這是結果

{'downloader/request_bytes': 574, 
'downloader/request_count': 2, 
'downloader/request_method_count/GET': 2, 
'downloader/response_bytes': 98877, 
'downloader/response_count': 2, 
'downloader/response_status_count/200': 2, 
'dupefilter/filtered': 3, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 2, 15, 13, 44, 17, 37859), 
'item_scraped_count': 24, 
'log_count/DEBUG': 28, 
'log_count/INFO': 8, 
'request_depth_max': 1, 
'response_received_count': 2, 
'scheduler/dequeued': 2, 
'scheduler/dequeued/memory': 2, 
'scheduler/enqueued': 2, 
'scheduler/enqueued/memory': 2, 
'start_time': datetime.datetime(2016, 2, 15, 13, 44, 11, 250892)} 
2016-02-15 14:44:17 [scrapy] INFO: Spider closed (finished) 

項目計數被假設是55不是24

回答

1

http://bitcoin.travel/listing-category/bitcoin-hotels-and-travel/,HTML源代碼包含在規則與模式匹配環節'.+/page/\d+/$'

<a class='page-numbers' href='http://bitcoin.travel/listing-category/bitcoin-hotels-and-travel/page/2/'>2</a> 
<a class='page-numbers' href='http://bitcoin.travel/listing-category/bitcoin-hotels-and-travel/page/3/'>3</a> 

其中http://bitcoin.travel/categories/不包含N條這樣的,主要包含鏈接到其他類別頁面:

... 
<li class="cat-item cat-item-227"><a href="http://bitcoin.travel/listing-category/bitcoin-food/bitcoin-coffee-tea-supplies/" title="The best Coffee &amp; Tea Supplies businesses where you can spend your bitcoins!">Coffee &amp; Tea Supplies</a> </li> 
<li class="cat-item cat-item-50"><a href="http://bitcoin.travel/listing-category/bitcoin-food/bitcoin-cupcakes/" title="The best Cupcakes businesses where you can spend your bitcoins!">Cupcakes</a> </li> 
<li class="cat-item cat-item-229"><a href="http://bitcoin.travel/listing-category/bitcoin-food/bitcoin-distilleries/" title="The best Distilleries businesses where you can spend your bitcoins!">Distilleries</a> </li> 
... 

您需要添加規則(S),如果你想抓取更多抓取這些類別頁面