2014-02-14 40 views
0

我試圖寫入我的蜘蛛的__init__方法的日誌,但我似乎無法得到它的工作,儘管它從解析方法工作正常。Scrapy - 無法寫入登錄蜘蛛的__init__方法

init方法中對self.log的調用由方法'get_urls_from_file'進行。我知道該方法正在被調用,因爲我在標準輸出中看到了print語句,所以我想知道是否有人可以指向正確的方向。我正在使用scrapy v0.18。謝謝!

我的代碼如下:

from scrapy.spider import BaseSpider 
from scrapy_redis import connection 
from importlib import import_module 
from scrapy import log 
from scrapy.settings import CrawlerSettings 

class StressS(BaseSpider): 
    name = 'stress_s_spider'              
    allowed_domains = ['www.example.com'] 

    def __init__(self, url_file=None, *args, **kwargs): 
     super(StressS, self).__init__(*args, **kwargs) 
     settings = CrawlerSettings(import_module('stress_test.settings')) 
     if url_file: 
      self.url_file = url_file 
     else: 
      self.url_file = settings.get('URL_FILE') 
     self.start_urls = self.get_urls_from_file(self.url_file) 
     self.server = connection.from_settings(settings) 
     self.count_key = settings.get('ITEM_COUNT') 

    def parse(self, response): 
     self.log('Processed: %s, status code: %s' % (response.url, response.status), level = log.INFO) 
     self.server.incr(self.count_key) 

    def get_urls_from_file(self, fn): 
     urls = [] 
     if fn: 
      try: 
       with open(fn, 'r') as f: 
        urls = [line.strip() for line in f.readlines()] 
      except IOError: 
       msg = 'File %s could not be opened' % fn 
       print msg 
       self.log(msg, level = log.ERROR) 
     return urls 
+0

要使用哪裏'self.log' INT你的'__init__'方法? –

+0

只是編輯問題以反映這一點 - 在init中,我在get_urls_from_file方法中調用self.log。 – user2871292

回答

1

您可以覆蓋start_requests方法:

# Default value for the argument in case it's missing. 
    url_file = None 

    def start_requests(self): 
     settings = self.crawler.settings 
     url_file = self.url_file if self.url_file else settings['URL_FILE'] 
     # set up server and count_key ... 
     # finally yield the requests 
     for url in self.get_urls_from_file(url_file): 
      yield Request(url, dont_filter=True) 

你也可以覆蓋的方法set_crawler並設置有屬性:

def set_crawler(self, crawler): 
     super(MySpider, self).set_crawler(crawler) 
     settings = crawler.settings 
     # set up start_urls ... 
+0

這似乎是一個合理的解決方法,尤其是給予類似需求的set_crawler方法。出於好奇,你知道爲什麼在init方法中寫入不會工作嗎?是否因爲日誌在被調用時未被初始化爲寫入? – user2871292

+0

@ user2871292,蜘蛛很早就被實例化了,因此你不能訪問許多還沒有設置的對象,比如self.crawler。 – Rolando

0

Scrapy 0.22
它看起來是不可能的。