2013-07-14 95 views
1

heya我正在使用scrapy製作一個項目,其中我需要從業務目錄中取消業務詳情http://directory.thesun.co.uk/find/uk/computer-repair
我面對的問題是:當我嘗試抓取頁面時,我的抓取工具抓取只有第一頁的細節,而我需要獲取其餘9頁的細節;這是所有10頁.. 我顯示我下面的代碼,蜘蛛和items.py和設置的.py 請參閱我的代碼,並幫助我解決這個問題Scrapy抓取只有第一頁

蜘蛛代碼::

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from project2.items import Project2Item 

class ProjectSpider(BaseSpider): 
    name = "project2spider" 
    allowed_domains = ["http://directory.thesun.co.uk/"] 
    start_urls = [ 
     "http://directory.thesun.co.uk/find/uk/computer-repair" 
    ] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//div[@class="abTbl "]') 
     items = [] 
     for site in sites: 
      item = Project2Item() 
      item['Catogory'] = site.select('span[@class="icListBusType"]/text()').extract() 
      item['Bussiness_name'] = site.select('a/@title').extract() 
      item['Description'] = site.select('span[last()]/text()').extract() 
      item['Number'] = site.select('span[@class="searchInfoLabel"]/span/@id').extract() 
      item['Web_url'] = site.select('span[@class="searchInfoLabel"]/a/@href').extract() 
      item['adress_name'] = site.select('span[@class="searchInfoLabel"]/span/text()').extract() 
      item['Photo_name'] = site.select('img/@alt').extract() 
      item['Photo_path'] = site.select('img/@src').extract() 
      items.append(item) 
     return items 

我items.py代碼如下::

from scrapy.item import Item, Field 

class Project2Item(Item): 
    Catogory = Field() 
    Bussiness_name = Field() 
    Description = Field() 
    Number = Field() 
    Web_url = Field() 
    adress_name = Field() 
    Photo_name = Field() 
    Photo_path = Field() 

我的settings.py是:::

BOT_NAME = 'project2' 

SPIDER_MODULES = ['project2.spiders'] 
NEWSPIDER_MODULE = 'project2.spiders' 

請幫我 從中提取其他網頁太...

回答

1

抓取描述.select('span/text()')您在//div[@class="abTbl "]選擇從所有span文本的詳細信息。 要提取的最後一個跨距可以使用'span[last()]/text()'的XPath

BTW這http://www.w3schools.com/xpath/xpath_syntax.asp應該幫助你XPathes