Scrapy CrawlSpider規則與多個回調

我正在創建一個實現Scrapy CrawlSpider的ExampleSpider。我的ExampleSpider應該能夠處理僅包含藝術家信息的頁面，僅包含相冊信息的頁面，以及包含相冊和藝術家信息的一些其他頁面。Scrapy CrawlSpider規則與多個回調

我能夠處理前兩種情況。但問題發生在第三種情況下。我正在使用parse_artist(response)方法來處理藝術家數據，parse_album(response)方法來處理相冊數據。我的問題是，如果一個頁面同時包含藝術家和專輯數據，我應該如何定義我的規則？

應該我喜歡下面嗎？（兩個規則相同的網址模式）
我應該多次回調嗎？（scrapy是否支持多種回調？）

有沒有其他方法可以做到這一點。（有道）

class ExampleSpider(CrawlSpider): 
    name = 'example' 

    start_urls = ['http://www.example.com'] 

    rules = [ 
     Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_artist', follow=True), 
     Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_album', follow=True), 
     # more rules ..... 
    ] 

    def parse_artist(self, response): 
     artist_item = ArtistItem() 
     try: 
      # do the scrape and assign to ArtistItem 
     except Exception: 
      # ignore for now 
      pass 
     return artist_item 
     pass 

    def parse_album(self, response): 
     album_item = AlbumItem() 
     try: 
      # do the scrape and assign to AlbumItem 
     except Exception: 
      # ignore for now 
      pass 
     return album_item 
     pass 
    pass

來源

2014-05-16 Grainier

的CrawlSpider電話_requests_to_follow()方法提取URL和生成請求如下：

def _requests_to_follow(self, response): 
    if not isinstance(response, HtmlResponse): 
     return 
    seen = set() 
    for n, rule in enumerate(self._rules): 
     links = [l for l in rule.link_extractor.extract_links(response) if l not in seen] 
     if links and rule.process_links: 
      links = rule.process_links(links) 
     seen = seen.union(links) 
     for link in links: 
      r = Request(url=link.url, callback=self._response_downloaded) 
      r.meta.update(rule=n, link_text=link.text) 
      yield rule.process_request(r)

正如你可以看到：

變量seen記憶urls已處理。
每個url將被解析至多一個callback。

可以定義一個parse_item()調用parse_artist()和parse_album()：

rules = [ 
    Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_item', follow=True), 
    # more rules ..... 
] 

def parse_item(self, response): 

    yield self.parse_artist(response) 
    yield self.parse_album(response)

來源

2014-05-16 13:43:00 kev

'parse_artist（響應）'會返回一個'ArtistItem（）'，'parse_album（響應）'將返回'AlbumItem（）'。因此，如果我正在使用項目管道（假設按順序持續存在），那麼它（持續管道）是否會被兩種類型的數據調用？ – Grainier

@GrainierPerera管道將逐一處理每個項目。 – kev

Scrapy CrawlSpider規則與多個回調

回答

相關問題