2014-04-29 80 views
4

我想傳遞一個參數在scrapy crawl ...命令行的規則定義將用於在擴展CrawlSpider,像下面如何在scrapy中訪問crawlspider中的命令行參數?

name = 'example.com' 
allowed_domains = ['example.com'] 
start_urls = ['http://www.example.com'] 

rules = (
    # Extract links matching 'category.php' (but not matching 'subsection.php') 
    # and follow links from them (since no callback means follow=True by default). 
    Rule(SgmlLinkExtractor(allow=('category\.php',), deny=('subsection\.php',))), 

    # Extract links matching 'item.php' and parse them with the spider's method parse_item 
    Rule(SgmlLinkExtractor(allow=('item\.php',)), callback='parse_item'), 
) 

我想的是,在SgmlLinkExtractor所述允許屬性在命令行參數指定。 我已經搜索了一下,發現我可以在spider的__init__方法中獲得參數值,但是如何才能在規則定義中使用命令行中的參數?

回答

5

你可以建立蜘蛛在__init__方法rules屬性,是這樣的:

class MySpider(CrawlSpider): 

    name = 'example.com' 
    allowed_domains = ['example.com'] 
    start_urls = ['http://www.example.com'] 

    def __init__(self, allow=None, *args, **kwargs): 
     self.rules = (
      Rule(SgmlLinkExtractor(allow=(self.allow,),)), 
     ) 
     super(MySpider, self).__init__(*args, **kwargs) 

而且你通過allow屬性這樣的命令行:

scrapy crawl example.com -a allow="item\.php"