2014-03-26 55 views
1

我注意到,ruleCrawlSpider在每個無葉頁面上提取網址。
僅噹噹前頁面滿足某些條件(例如:url匹配正則表達式)時,纔可以啓用rule如何製作CrawlSpider上下文相關的規則?

我有兩個頁面:


-------------------Page A------------------- 
Page URL: http://www.site.com/pattern-match.html 
-------------------------------------------- 

- [link](http://should-extract-this) 
- [link](http://should-extract-this) 
- [link](http://should-extract-this) 

-------------------------------------------- 

--------------------Page B-------------------- 
Page URL: http://www.site.com/pattern-not-match.html 
----------------------------------------------- 

- [link](http://should-not-extract-this) 
- [link](http://should-not-extract-this) 
- [link](http://should-not-extract-this) 

----------------------------------------------- 

所以,規則應該只從網頁A提取URL。怎麼做?謝謝!

+0

的問題是不明確的。你在尋找一個特定的規則模式嗎? 'Rule(SgmlLinkExtractor(allow =('pattern-match',),deny =('pattern-not-match',)))' – agstudy

+0

@agstudy我正在尋找一個簡潔的方式來製作Rule規則支持上下文SgmlLinkExtractor提取網址的當前頁面)。 – kev

回答

1

我剛剛發現一種骯髒的方式將response注入到rule

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

from scrapy.http import Request, HtmlResponse 
from scrapy.contrib.spiders import CrawlSpider, Rule 

import inspect 

class MyCrawlSpider(CrawlSpider): 

    def _requests_to_follow(self, response): 
     if not isinstance(response, HtmlResponse): 
      return 
     seen = set() 
     for n, rule in enumerate(self._rules): 
      links = [l for l in rule.link_extractor.extract_links(response) if l not in seen] 
      if links and rule.process_links: 
       links = rule.process_links(links) 
      seen = seen.union(links) 
      for link in links: 
       r = Request(url=link.url, callback=self._response_downloaded) 
       r.meta.update(rule=n, link_text=link.text) 

       # ***>>> HACK <<<*** 
       # pass `response` as additional argument to `process_request` 

       fun = rule.process_request 
       if not hasattr(fun, 'nargs'): 
        fun.nargs = len(inspect.getargs(fun.func_code).args) 
       if fun.nargs==1: 
        yield fun(r) 
       elif fun.nargs==2: 
        yield fun(r, response) 
       else: 
        raise Exception('too many arguments') 

試試看:

def process_request(request, response): 

    if 'magick' in response.url: 
     return request 

class TestSpider(MyCrawlSpider): 

    name = 'test' 
    allowed_domains = ['test.com'] 
    start_urls = ['http://www.test.com'] 

    rules = [ 
     Rule(SgmlLinkExtractor(restrict_xpaths='//a'), callback='parse_item', process_request=process_request), 
    ] 

    def parse_item(self, response): 

     print response.url