多頁的scrapy讓我的項目太快而無法完成 - 函數無法鏈接並等待完成

我正在製作一個足球應用程序，試圖圍繞多頁面刮擦的工作方式來打動我的頭。多頁的scrapy讓我的項目太快而無法完成 - 函數無法鏈接並等待完成

例如，在第一頁（http://footballdatabase.com/ranking/world/1）是2套的鏈接我想刮：俱樂部名稱的鏈接，以及分頁鏈接

我想通過一）每一頁（分頁），然後b）通過每個俱樂部，並抓住其當前歐盟排名。

我寫的代碼有些作品。不過，我最終只得到大約45個結果，而不是2000多個俱樂部。 - 注意：有45頁的分頁。所以一旦它完成了，所有東西都完成了並且我的物品被放棄了。

我怎樣才能讓所有鏈條連在一起，所以我最終得到的結果更像2000+？

這裏是我的代碼

# get Pagination links 
def parse(self, response): 
    for href in response.css("ul.pagination > li > a::attr('href')"): 
     url = response.urljoin(href.extract()) 
     yield scrapy.Request(url, callback=self.parse_club) 

# get club links on each of the pagination pages 
def parse_club(self, response): 


    # loop through each of the rows 
    for sel in response.xpath('//table/tbody/tr'): 

     item = rankingItem() 

      item['name'] = sel.xpath('td/a/div[@class="limittext"]/text()').extract() 

      # get more club information 
      club_href = sel.xpath('td[2]/a[1]/@href').extract_first() 
      club_url = response.urljoin(club_href) 
      request = scrapy.Request(club_url,callback=self.parse_club_page_2) 

      request.meta['item'] = item 
      return request 

# get the EU ranking on each of the club pages 
def parse_club_page_2(self,response): 

    item = response.meta['item'] 
    item['eu_ranking'] = response.xpath('//a[@class="label label-default"][2]/text()').extract() 

    yield item

來源

2016-02-26 willdanceforfun

您從parse_club回調需要yield - 不return：

# get club links on each of the pagination pages 
def parse_club(self, response): 
    # loop through each of the rows 
    for sel in response.xpath('//table/tbody/tr'):  
     item = rankingItem()  
     item['name'] = sel.xpath('td/a/div[@class="limittext"]/text()').extract() 

     # get more club information 
     club_href = sel.xpath('td[2]/a[1]/@href').extract_first() 
     club_url = response.urljoin(club_href) 
     request = scrapy.Request(club_url,callback=self.parse_club_page_2) 

     request.meta['item'] = item 
     yield request # FIX HERE

我也將簡化元素的定位部分：

def parse_club(self, response): 
    # loop through each of the rows 
    for sel in response.css('td.club'): 
     item = rankingItem() 
     item['name'] = sel.xpath('.//div[@itemprop="itemListElement"]/text()').extract_first() 

     # get more club information 
     club_href = sel.xpath('.//a/@href').extract_first() 
     club_url = response.urljoin(club_href) 
     request = scrapy.Request(club_url, callback=self.parse_club_page_2) 

     request.meta['item'] = item 
     yield request

來源

2016-02-26 15:26:43 alecxe

我認爲不言而喻，在我看來，你是上帝。我準備重寫整個事情。 – willdanceforfun

多頁的scrapy讓我的項目太快而無法完成 - 函數無法鏈接並等待完成

回答

相關問題