如何製作CrawlSpider上下文相關的規則？

我注意到，rule的CrawlSpider在每個無葉頁面上提取網址。
僅噹噹前頁面滿足某些條件（例如：url匹配正則表達式）時，纔可以啓用rule？如何製作CrawlSpider上下文相關的規則？

我有兩個頁面：

-------------------Page A------------------- 
Page URL: http://www.site.com/pattern-match.html 
-------------------------------------------- 

- [link](http://should-extract-this) 
- [link](http://should-extract-this) 
- [link](http://should-extract-this) 

--------------------------------------------

--------------------Page B-------------------- 
Page URL: http://www.site.com/pattern-not-match.html 
----------------------------------------------- 

- [link](http://should-not-extract-this) 
- [link](http://should-not-extract-this) 
- [link](http://should-not-extract-this) 

-----------------------------------------------

所以，規則應該只從網頁A提取URL。怎麼做？謝謝！

來源

2014-03-26 kev

的問題是不明確的。你在尋找一個特定的規則模式嗎？ 'Rule（SgmlLinkExtractor（allow =（'pattern-match'，），deny =（'pattern-not-match'，）））' – agstudy

@agstudy我正在尋找一個簡潔的方式來製作Rule規則支持上下文SgmlLinkExtractor提取網址的當前頁面）。 – kev

我剛剛發現一種骯髒的方式將response注入到rule。

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

from scrapy.http import Request, HtmlResponse 
from scrapy.contrib.spiders import CrawlSpider, Rule 

import inspect 

class MyCrawlSpider(CrawlSpider): 

    def _requests_to_follow(self, response): 
     if not isinstance(response, HtmlResponse): 
      return 
     seen = set() 
     for n, rule in enumerate(self._rules): 
      links = [l for l in rule.link_extractor.extract_links(response) if l not in seen] 
      if links and rule.process_links: 
       links = rule.process_links(links) 
      seen = seen.union(links) 
      for link in links: 
       r = Request(url=link.url, callback=self._response_downloaded) 
       r.meta.update(rule=n, link_text=link.text) 

       # ***>>> HACK <<<*** 
       # pass `response` as additional argument to `process_request` 

       fun = rule.process_request 
       if not hasattr(fun, 'nargs'): 
        fun.nargs = len(inspect.getargs(fun.func_code).args) 
       if fun.nargs==1: 
        yield fun(r) 
       elif fun.nargs==2: 
        yield fun(r, response) 
       else: 
        raise Exception('too many arguments')

試試看：

def process_request(request, response): 

    if 'magick' in response.url: 
     return request 

class TestSpider(MyCrawlSpider): 

    name = 'test' 
    allowed_domains = ['test.com'] 
    start_urls = ['http://www.test.com'] 

    rules = [ 
     Rule(SgmlLinkExtractor(restrict_xpaths='//a'), callback='parse_item', process_request=process_request), 
    ] 

    def parse_item(self, response): 

     print response.url

來源

2014-03-27 10:44:37 kev

如何製作CrawlSpider上下文相關的規則？

回答

相關問題