Scrapy：__init__中設置的規則被CrawlSpider忽略

我一直堅持這幾天，它讓我瘋了。Scrapy：__init__中設置的規則被CrawlSpider忽略

我打電話給我的scrapy蜘蛛這樣的：

scrapy crawl example -a follow_links="True"

我通過在「FOLLOW_LINKS」標誌，以確定整個網站是否應該被刮掉，或者只是索引頁我在蜘蛛定義。

這個標誌在蜘蛛的構造檢查，看哪些規則應該設置：

def __init__(self, *args, **kwargs): 

    super(ExampleSpider, self).__init__(*args, **kwargs) 

    self.follow_links = kwargs.get('follow_links') 
    if self.follow_links == "True": 
     self.rules = (
      Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True), 
     ) 
    else: 
     self.rules = (
      Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False), 
     )

如果是「真」，各個環節都允許的;如果它是「假」，則所有鏈接都被拒絕。

到目前爲止，這麼好，但是這些規則被忽略了。我可以遵循規則的唯一方法是如果我在構造函數之外定義它們。這意味着，像這樣的正常工作：

class ExampleSpider(CrawlSpider): 

    rules = (
     Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False), 
    ) 

    def __init__(self, *args, **kwargs): 
     ...

因此，基本上，定義__init__構造函數中的規則會導致忽略的規則，而定義之外的構造函數按預期工作的規則。

我無法理解這一點。我的代碼如下。

import re 
import scrapy 

from scrapy.linkextractors import LinkExtractor 
from scrapy.spiders import CrawlSpider, Rule 
from w3lib.html import remove_tags, remove_comments, replace_escape_chars, replace_entities, remove_tags_with_content 


class ExampleSpider(CrawlSpider): 

    name = "example" 
    allowed_domains = ['example.com'] 
    start_urls = ['http://www.example.com']  
    # if the rule below is uncommented, it works as expected (i.e. follow links and call parse_pages) 
    # rules = (
    #  Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True), 
    #) 

    def __init__(self, *args, **kwargs): 

     super(ExampleSpider, self).__init__(*args, **kwargs) 

     # single page or follow links 
     self.follow_links = kwargs.get('follow_links') 
     if self.follow_links == "True": 
      # the rule below will always be ignored (why?!) 
      self.rules = (
       Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True), 
      ) 
     else: 
      # the rule below will always be ignored (why?!) 
      self.rules = (
       Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False), 
      ) 


    def parse_pages(self, response): 
     print("In parse_pages") 
     print(response.xpath('/html/body').extract()) 
     return None 


    def parse_start_url(self, response): 
     print("In parse_start_url") 
     print(response.xpath('/html/body').extract()) 
     return None

謝謝您花時間幫助我解決這個問題。

來源

2016-09-16 Tom Brock

你可以嘗試調用之前設置你的規則'超（ExampleSpider，...' – eLRuLL

@eLRuLL，你應該張貼此作爲一個答案 –

的這裏的問題是，CrawlSpider構造（__init__）也處理rules參數，所以如果你需要給它們，你就必須調用默認構造函數之前做到這一點。

換句話說你叫super(ExampleSpider, self).__init__(*args, **kwargs)之前所需要的一切：

def __init__(self, *args, **kwargs): 
    # setting my own rules 
    super(ExampleSpider, self).__init__(*args, **kwargs)

來源

2016-09-18 00:41:27 eLRuLL

這就是它，你必須在調用super（）之前設置你的規則，start_urls和allowed_domains。 –

Scrapy：init中設置的規則被CrawlSpider忽略

回答

Scrapy：__init__中設置的規則被CrawlSpider忽略

回答

相關問題

Scrapy：init中設置的規則被CrawlSpider忽略