2014-01-13 61 views
2

我已經設置規則從start_url獲取下一頁,但它不工作,它只抓取start_urls頁面以及該頁面中的鏈接(使用parseLinks)。它不會轉到規則中設置的下一頁。我如何跳轉到下一個頁面在Scrapy規則

有幫助嗎?

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import Selector 
from scrapy import log 
from urlparse import urlparse 
from urlparse import urljoin 
from scrapy.http import Request 

class MySpider(CrawlSpider): 
    name = 'testes2' 
    allowed_domains = ['example.com'] 
    start_urls = [ 
    'http://www.example.com/pesquisa/filtro/?tipo=0&local=0' 
] 

rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="seguinte"]/@href')), follow=True),) 

def parse(self, response): 
    sel = Selector(response) 
    urls = sel.xpath('//div[@id="btReserve"]/../@href').extract() 
    for url in urls: 
     url = urljoin(response.url, url) 
     self.log('URLS: %s' % url) 
     yield Request(url, callback = self.parseLinks) 

def parseLinks(self, response): 
    sel = Selector(response) 
    titulo = sel.xpath('h1/text()').extract() 
    morada = sel.xpath('//div[@class="MORADA"]/text()').extract() 
    email = sel.xpath('//a[@class="sendMail"][1]/text()')[0].extract() 
    url = sel.xpath('//div[@class="contentContacto sendUrl"]/a/text()').extract() 
    telefone = sel.xpath('//div[@class="telefone"]/div[@class="contentContacto"]/text()').extract() 
    fax = sel.xpath('//div[@class="fax"]/div[@class="contentContacto"]/text()').extract() 
    descricao = sel.xpath('//div[@id="tbDescricao"]/p/text()').extract() 
    gps = sel.xpath('//td[@class="sendGps"]/@style').extract() 

    print titulo, email, morada 
+0

檢查這個答案,這將解決這個問題:http://stackoverflow.com/questions/13227546/scrapy-crawls-first-page-but -does-not-follow-links?answertab = votes#tab-top – Perefexexos

回答

4

你不應該從CrawlSpider覆蓋parse方法,否則Rule旨意不遵循。

http://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules

看到警告當寫爬行蜘蛛的規則,避免使用解析回調,因爲CrawlSpider採用解析法本身來實現它的邏輯。因此,如果您重寫解析方法,抓取蜘蛛將不再起作用。

+0

我已經將parsePage更改爲parsePage,並且將規則回調設置爲callback ='parsePage',並且知道它沒有進入def parsePage –

+0

嘗試使用'restrict_xpaths = ('// a [@ id =「seguinte」]')),callback ='parsePage',follow = True),)' –

+0

謝謝你,保羅,現在有效 –

1

您正在使用蜘蛛類流量:

class MySpider(CrawlSpider): is not the proper class 
    instead of this use : class MySpider(Spider) 
name = 'testes2' 
allowed_domains = ['example.com'] 
start_urls = [ 
'http://www.example.com/pesquisa/filtro/?tipo=0&local=0' 
] 

In Spider Class you do not need rules. So discard it. 
"Not Usable in Spider Class" rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="seguinte"]/@href')), follow=True),)