如何在scrapy中訪問crawlspider中的命令行參數？

我想傳遞一個參數在scrapy crawl ...命令行的規則定義將用於在擴展CrawlSpider，像下面如何在scrapy中訪問crawlspider中的命令行參數？

name = 'example.com' 
allowed_domains = ['example.com'] 
start_urls = ['http://www.example.com'] 

rules = (
    # Extract links matching 'category.php' (but not matching 'subsection.php') 
    # and follow links from them (since no callback means follow=True by default). 
    Rule(SgmlLinkExtractor(allow=('category\.php',), deny=('subsection\.php',))), 

    # Extract links matching 'item.php' and parse them with the spider's method parse_item 
    Rule(SgmlLinkExtractor(allow=('item\.php',)), callback='parse_item'), 
)

我想的是，在SgmlLinkExtractor所述允許屬性在命令行參數指定。我已經搜索了一下，發現我可以在spider的__init__方法中獲得參數值，但是如何才能在規則定義中使用命令行中的參數？

來源

2014-04-29 David

你可以建立蜘蛛在__init__方法rules屬性，是這樣的：

class MySpider(CrawlSpider): 

    name = 'example.com' 
    allowed_domains = ['example.com'] 
    start_urls = ['http://www.example.com'] 

    def __init__(self, allow=None, *args, **kwargs): 
     self.rules = (
      Rule(SgmlLinkExtractor(allow=(self.allow,),)), 
     ) 
     super(MySpider, self).__init__(*args, **kwargs)

而且你通過allow屬性這樣的命令行：

scrapy crawl example.com -a allow="item\.php"

來源

2014-04-29 08:54:08

如何在scrapy中訪問crawlspider中的命令行參數？

回答

相關問題