Scrapy只會抓取第1頁，而不是休息

部屋我正在使用scrapy中，我需要從企業目錄http://directory.thesun.co.uk/find/uk/computer-repair
我面臨的問題報廢的業務細節的項目是：當我試圖抓取頁面我的抓取工具只抓取第一頁的細節，而我需要抓取其餘9頁的細節;這是所有10頁.. 我顯示我下面的代碼，蜘蛛和items.py和設置的.py 請參閱我的代碼，並幫助我解決這個問題Scrapy只會抓取第1頁，而不是休息

蜘蛛代碼::

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from project2.items import Project2Item 

class ProjectSpider(BaseSpider): 
    name = "project2spider" 
    allowed_domains = ["http://directory.thesun.co.uk/"] 
    start_urls = [ 
     "http://directory.thesun.co.uk/find/uk/computer-repair" 
    ] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//div[@class="abTbl "]') 
     items = [] 
     for site in sites: 
      item = Project2Item() 
      item['Catogory'] = site.select('span[@class="icListBusType"]/text()').extract() 
      item['Bussiness_name'] = site.select('a/@title').extract() 
      item['Description'] = site.select('span[last()]/text()').extract() 
      item['Number'] = site.select('span[@class="searchInfoLabel"]/span/@id').extract() 
      item['Web_url'] = site.select('span[@class="searchInfoLabel"]/a/@href').extract() 
      item['adress_name'] = site.select('span[@class="searchInfoLabel"]/span/text()').extract() 
      item['Photo_name'] = site.select('img/@alt').extract() 
      item['Photo_path'] = site.select('img/@src').extract() 
      items.append(item) 
     return items

我items.py代碼如下::

from scrapy.item import Item, Field 

class Project2Item(Item): 
    Catogory = Field() 
    Bussiness_name = Field() 
    Description = Field() 
    Number = Field() 
    Web_url = Field() 
    adress_name = Field() 
    Photo_name = Field() 
    Photo_path = Field()

我的settings.py是:::

BOT_NAME = 'project2' 

SPIDER_MODULES = ['project2.spiders'] 
NEWSPIDER_MODULE = 'project2.spiders'

請幫我從中提取其他頁面太細節...

來源

2013-07-14 Abhimanyu

以下是工作代碼。滾動頁面應該通過學習網站及其滾動結構並按照相應的方式應用。在這種情況下，網站已經給它「/ page/x」，其中x是頁碼。

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from project2spider.items import Project2Item 
from scrapy.http import Request 

class ProjectSpider(BaseSpider): 
    name = "project2spider" 
    allowed_domains = ["http://directory.thesun.co.uk"] 
    current_page_no = 1 
    start_urls = [ 
     "http://directory.thesun.co.uk/find/uk/computer-repair" 
    ] 

    def get_next_url(self, fired_url): 
     if '/page/' in fired_url: 
      url, page_no = fired_url.rsplit('/page/', 1) 
     else: 
      if self.current_page_no != 1: 
       #end of scroll 
       return 
     self.current_page_no += 1 
     return "http://directory.thesun.co.uk/find/uk/computer-repair/page/%s" % self.current_page_no 

    def parse(self, response): 
     fired_url = response.url 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//div[@class="abTbl "]') 
     for site in sites: 
      item = Project2Item() 
      item['Catogory'] = site.select('span[@class="icListBusType"]/text()').extract() 
      item['Bussiness_name'] = site.select('a/@title').extract() 
      item['Description'] = site.select('span[last()]/text()').extract() 
      item['Number'] = site.select('span[@class="searchInfoLabel"]/span/@id').extract() 
      item['Web_url'] = site.select('span[@class="searchInfoLabel"]/a/@href').extract() 
      item['adress_name'] = site.select('span[@class="searchInfoLabel"]/span/text()').extract() 
      item['Photo_name'] = site.select('img/@alt').extract() 
      item['Photo_path'] = site.select('img/@src').extract() 
      yield item 
     next_url = self.get_next_url(fired_url) 
     if next_url: 
      yield Request(next_url, self.parse, dont_filter=True) 
`

來源

2013-07-15 17:12:22

，如果您檢查分頁鏈接，就像這樣：

http://directory.thesun.co.uk/find/uk/computer-repair/page/3 http://directory.thesun.co.uk/find/uk/computer-repair/page/2

你可以循環頁使用的urllib2具有可變

import urllib2 
response = urllib2.urlopen('http://directory.thesun.co.uk/find/uk/computer-repair/page/' + page) 
html = response.read()

並刮掉html。

來源

2013-07-14 18:48:18

我嘗試代碼@ nizam.sp。已發佈，並且僅顯示2條記錄，從主頁面顯示1條記錄（最後一條記錄），從第二頁面（隨機記錄）顯示1條記錄並結束。

來源

2013-07-15 21:33:56 Gio

Scrapy只會抓取第1頁，而不是休息

回答

相關問題