2016-08-08 66 views
0

我正在構建一個可選登錄的遞歸webspider。我想通過json配置文件使大多數設置動態化。scrapy InitSpider:在__init__中設置規則?

在我的__init__函數中,我正在閱讀此文件並嘗試填充所有變量,但是,這不適用於Rules

class CrawlpySpider(InitSpider): 

... 

#---------------------------------------------------------------------- 
def __init__(self, *args, **kwargs): 
    """Constructor: overwrite parent __init__ function""" 

    # Call parent init 
    super(CrawlpySpider, self).__init__(*args, **kwargs) 

    # Get command line arg provided configuration param 
    config_file = kwargs.get('config') 

    # Validate configuration file parameter 
    if not config_file: 
     logging.error('Missing argument "-a config"') 
     logging.error('Usage: scrapy crawl crawlpy -a config=/path/to/config.json') 
     self.abort = True 

    # Check if it is actually a file 
    elif not os.path.isfile(config_file): 
     logging.error('Specified config file does not exist') 
     logging.error('Not found in: "' + config_file + '"') 
     self.abort = True 

    # All good, read config 
    else: 
     # Load json config 
     fpointer = open(config_file) 
     data = fpointer.read() 
     fpointer.close() 

     # convert JSON to dict 
     config = json.loads(data) 

     # config['rules'] is simply a string array which looks like this: 
     # config['rules'] = [ 
     # 'password', 
     # 'reset', 
     # 'delete', 
     # 'disable', 
     # 'drop', 
     # 'logout', 
     # ] 

     CrawlpySpider.rules = (
      Rule(
       LinkExtractor(
        allow_domains=(self.allowed_domains), 
        unique=True, 
        deny=tuple(config['rules']) 
       ), 
       callback='parse', 
       follow=False 
      ), 
     ) 

Scrapy仍然爬存在於config['rules'],因此也打0​​頁面的頁面。所以指定的頁面不會被拒絕。我在這裏錯過了什麼?

更新:

我已經通過設置CrawlpySpider.rules = ...以及self.rules = ...__init__嘗試。兩種變體都不起作用。

  • 蜘蛛:InitSpider
  • 規則:LinkExtractor
  • 爬行前:做事先登錄爬行

我甚至試圖否認,在我的parse功能

# Dive deeper? 
    # The nesting depth is now handled via a custom middle-ware (middlewares.py) 
    #if curr_depth < self.max_depth or self.max_depth == 0: 
    links = LinkExtractor().extract_links(response) 
    for link in links: 
     for ignore in self.ignores: 
      if (ignore not in link.url) and (ignore.lower() not in link.url.lower()) and link.url.find(ignore) == -1: 
       yield Request(link.url, meta={'depth': curr_depth+1, 'referer': response.url}) 

回答

0

你是在要設置實例屬性的位置設置類屬性:

# this: 
CrawlpySpider.rules = (
# should be this: 
self.rules = (
<...> 
+0

我也試過,以及它沒有工作。我已將這些信息添加到我的問題中。 – cytopia

+0

InitSpider似乎沒有像'_compile_rules'這樣的東西(就像CrawlSpider一樣)。顯然它看起來像InitSpider甚至沒有規則,因爲它只是從Spider繼承而來。 CrawlSpider確實實現了這一切。 – cytopia