2017-05-26 58 views
0

我做過scrapy-redis爬蟲並決定做一個分佈式爬蟲。對於更多,我想使它成爲基於任務,一個任務的一個名稱。所以,我打算把蜘蛛的名字改成任務的名字,並用這個名字來區分每個任務。因此,我在運行Web管理期間遇到了一個如何更改蜘蛛名稱的問題。有什麼方法可以通過腳本更改scrapy蜘蛛的名字

這是我的代碼,這是不成熟的:

#-*- encoding: utf-8 -*- 
import redis 
from scrapy.crawler import CrawlerProcess 
from scrapy.utils.project import get_project_settings 
from scrapy_redis.spiders import RedisSpider 
import pymongo 
client = pymongo.MongoClient('mongodb://localhost:27017') 
db_name = 'news' 
db = client[db_name] 

class NewsSpider(RedisSpider): 
    """Spider that reads urls from redis queue (myspider:start_urls).""" 
    name = 'news' 
    redis_key = 'news:start_urls' 
    start_urls = ["http://www.bbc.com/news"] 

    def parse(self, response): 
     pass 
    # I add those ,setname and getname 
    def setname(self, name): 
     self.name = name 

    def getname(self): 
     return self.name 

def start(): 
    news_spider = NewsSpider() 
    news_spider.setname('test_spider_name') 
    print news_spider.getname() 
    r = redis.Redis(host='127.0.0.1', port=6379, db=0) 
    r.lpush('news:start_urls', 'http://news.sohu.com/') 
    process = CrawlerProcess(get_project_settings()) 
    process.crawl('test_spider_name') 
    process.start() # the script will block here until the crawling is finished 

if __name__ == '__main__': 
    start() 

而且有錯誤:

test_spider_name 
2017-05-26 20:14:05 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot) 
2017-05-26 20:14:05 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'geospider.spiders', 'SPIDER_MODULES': ['geospider.spiders'], 'COOKIES_ENABLED': False, 'SCHEDULER': 'scrapy_redis.scheduler.Scheduler', 'DUPEFILTER_CLASS': 'scrapy_redis.dupefilter.RFPDupeFilter'} 
Traceback (most recent call last): 
    File "/home/kui/work/python/project/bigcrawler/geospider/control/command.py", line 29, in <module> 
    start() 
    File "/home/kui/work/python/project/bigcrawler/geospider/control/command.py", line 23, in start 
    process.crawl('test_spider_name') 
    File "/home/kui/work/python/env/lib/python2.7/site-packages/scrapy/crawler.py", line 162, in crawl 
    crawler = self.create_crawler(crawler_or_spidercls) 
    File "/home/kui/work/python/env/lib/python2.7/site-packages/scrapy/crawler.py", line 190, in create_crawler 
    return self._create_crawler(crawler_or_spidercls) 
    File "/home/kui/work/python/env/lib/python2.7/site-packages/scrapy/crawler.py", line 194, in _create_crawler 
    spidercls = self.spider_loader.load(spidercls) 
    File "/home/kui/work/python/env/lib/python2.7/site-packages/scrapy/spiderloader.py", line 55, in load 
    raise KeyError("Spider not found: {}".format(spider_name)) 
KeyError: 'Spider not found: test_spider_name' 

我知道這是一個愚蠢的方法,我在尋找一個很長一段時間淨。但沒有用處。請幫助我或提出一些想法如何實現這一點。

在此先感謝。

+0

謝謝你,但沒有奏效。 – haomao

回答

0

這可能幫助:

class NewsSpider(RedisSpider): 
    """Spider that reads urls from redis queue (myspider:start_urls).""" 
    name = 'news_redis' 
    redis_key = 'news:start_urls' 
    start_urls = ["http://www.bbc.com/news"] 

    def parse(self, response): 
     pass 

def start(): 
    news_spider = NewsSpider() 

    # Set name & redis_key for NewsSpider 
    NewsSpider.name = 'test_spider_name_redis' 
    NewsSpider.redis_key = NewsSpider.name + ':start_urls' 

    r = redis.Redis(host='127.0.0.1', port=6379, db=0) 
    r.lpush(NewsSpider.name + ':start_urls', 'http://news.sohu.com/') 
    process = CrawlerProcess(get_project_settings()) 
    process.crawl(NewsSpider) 
    process.start() # the script will block here until the crawling is finished 

if __name__ == '__main__': 
    start()