2013-07-14 79 views
0

部屋我正在使用scrapy中,我需要從企業目錄http://directory.thesun.co.uk/find/uk/computer-repair
我面臨的問題報廢的業務細節的項目是:當我試圖抓取頁面我的抓取工具只抓取第一頁的細節,而我需要抓取其餘9頁的細節;這是所有10頁.. 我顯示我下面的代碼,蜘蛛和items.py和設置的.py 請參閱我的代碼,並幫助我解決這個問題Scrapy只會抓取第1頁,而不是休息

蜘蛛代碼::

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from project2.items import Project2Item 

class ProjectSpider(BaseSpider): 
    name = "project2spider" 
    allowed_domains = ["http://directory.thesun.co.uk/"] 
    start_urls = [ 
     "http://directory.thesun.co.uk/find/uk/computer-repair" 
    ] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//div[@class="abTbl "]') 
     items = [] 
     for site in sites: 
      item = Project2Item() 
      item['Catogory'] = site.select('span[@class="icListBusType"]/text()').extract() 
      item['Bussiness_name'] = site.select('a/@title').extract() 
      item['Description'] = site.select('span[last()]/text()').extract() 
      item['Number'] = site.select('span[@class="searchInfoLabel"]/span/@id').extract() 
      item['Web_url'] = site.select('span[@class="searchInfoLabel"]/a/@href').extract() 
      item['adress_name'] = site.select('span[@class="searchInfoLabel"]/span/text()').extract() 
      item['Photo_name'] = site.select('img/@alt').extract() 
      item['Photo_path'] = site.select('img/@src').extract() 
      items.append(item) 
     return items 

我items.py代碼如下::

from scrapy.item import Item, Field 

class Project2Item(Item): 
    Catogory = Field() 
    Bussiness_name = Field() 
    Description = Field() 
    Number = Field() 
    Web_url = Field() 
    adress_name = Field() 
    Photo_name = Field() 
    Photo_path = Field() 

我的settings.py是:::

BOT_NAME = 'project2' 

SPIDER_MODULES = ['project2.spiders'] 
NEWSPIDER_MODULE = 'project2.spiders' 

請幫我 從中提取其他頁面太細節...

回答

0

以下是工作代碼。滾動頁面應該通過學習 網站及其滾動結構並按照相應的方式應用。在這種情況下,網站已經給它「/ page/x」,其中x是頁碼。

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from project2spider.items import Project2Item 
from scrapy.http import Request 

class ProjectSpider(BaseSpider): 
    name = "project2spider" 
    allowed_domains = ["http://directory.thesun.co.uk"] 
    current_page_no = 1 
    start_urls = [ 
     "http://directory.thesun.co.uk/find/uk/computer-repair" 
    ] 

    def get_next_url(self, fired_url): 
     if '/page/' in fired_url: 
      url, page_no = fired_url.rsplit('/page/', 1) 
     else: 
      if self.current_page_no != 1: 
       #end of scroll 
       return 
     self.current_page_no += 1 
     return "http://directory.thesun.co.uk/find/uk/computer-repair/page/%s" % self.current_page_no 

    def parse(self, response): 
     fired_url = response.url 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//div[@class="abTbl "]') 
     for site in sites: 
      item = Project2Item() 
      item['Catogory'] = site.select('span[@class="icListBusType"]/text()').extract() 
      item['Bussiness_name'] = site.select('a/@title').extract() 
      item['Description'] = site.select('span[last()]/text()').extract() 
      item['Number'] = site.select('span[@class="searchInfoLabel"]/span/@id').extract() 
      item['Web_url'] = site.select('span[@class="searchInfoLabel"]/a/@href').extract() 
      item['adress_name'] = site.select('span[@class="searchInfoLabel"]/span/text()').extract() 
      item['Photo_name'] = site.select('img/@alt').extract() 
      item['Photo_path'] = site.select('img/@src').extract() 
      yield item 
     next_url = self.get_next_url(fired_url) 
     if next_url: 
      yield Request(next_url, self.parse, dont_filter=True) 
` 
0

我嘗試代碼@ nizam.sp。已發佈,並且僅顯示2條記錄,從主頁面顯示1條記錄(最後一條記錄),從第二頁面(隨機記錄)顯示1條記錄並結束。