在Scrapy中創建可編輯的CrawlSpider規則

我一直在嘗試創建一個簡單的Scrapy CrawlSpider腳本，該腳本可以很容易地進行更改，但我無法弄清楚如何讓鏈接提取器規則正常工作。在Scrapy中創建可編輯的CrawlSpider規則

這裏是我的代碼：

class LernaSpider(CrawlSpider): 
"""Our ad-hoc spider""" 

name = "lerna" 

def __init__(self, url, allow_follow='.*', deny_follow='', allow_extraction='.*', deny_extraction=''): 
    parsed_url = urlparse(url) 
    domain = str(parsed_url.netloc) 
    self.allowed_domains = [domain] 
    self.start_urls = [url] 
    self.rules = (
     # Extract links 
     # and follow links from them (since no callback means follow=True by default). 
     Rule(SgmlLinkExtractor(allow=(allow_follow,), deny=(deny_follow,))), 

     # Extract links and parse them with the spider's method parse_item 
     Rule(SgmlLinkExtractor(allow=(allow_extraction,), deny=(deny_extraction,)), callback='parse_item'), 
    ) 

    super(LernaSpider, self).__init__() 

def parse_item(self, response): 

    print 'Crawling... %s' % response.url 
    # more stuff here

我有這樣的代碼，但我從來沒有得到過允許/拒絕規則，以正常工作，我真的不知道爲什麼。是否留下空弦使它拒絕一切？我認爲，因爲這是一個RE，如果我輸入'。*'或其他什麼，它只會做一個全面的否定。

任何幫助，將不勝感激。

來源

2013-03-22 oiez

你是在自己實例化蜘蛛嗎？是這樣的：

spider = LernaSpider('http://example.com')

，否則如果你是從你使用的URL作爲構造函數的第一個參數不正確的命令行運行$scrapy crawl lerna（應該是名字），你也不會再傳遞到超。也許試試這個：

class LernaSpider(CrawlSpider): 
    """Our ad-hoc spider""" 

    name = "lerna" 

    def __init__(self, name=None, url=url, allow_follow='.*', deny_follow='', allow_extraction='.*', deny_extraction='', **kw): 
     parsed_url = urlparse(url) 
     domain = str(parsed_url.netloc) 
     self.allowed_domains = [domain] 
     self.start_urls = [url] 
     self.rules = (
      # Extract links 
      # and follow links from them (since no callback means follow=True by default). 
      Rule(SgmlLinkExtractor(allow=allow_follow, deny=deny_follow)), 

      # Extract links and parse them with the spider's method parse_item 
      Rule(SgmlLinkExtractor(allow=allow_extraction, deny=deny_extraction), callback='parse_item'), 
     ) 
     super(LernaSpider, self).__init__(name, **kw) 

    def parse_item(self, response): 
     print 'Crawling... %s' % response.url 
     # more stuff here

正則表達式的東西看起來很好：空值允許所有和拒絕沒有。

來源

2013-03-22 22:44:36

是的，我正在腳本中自己實例化蜘蛛。代碼就像，crawler = CrawlerProcess（設置） spider = LernaSpider（url） crawler.crawl（蜘蛛）還有更多的東西比它，顯然，但這是簡短的版本。 – oiez 2013-03-22 23:11:40

我只是試着改變允許=（allow_extraction）允許= allow_extraction，它的工作！不是100％肯定爲什麼，但感謝給我一些工作。 :) – oiez 2013-03-22 23:30:09

@steven almeroth，我可以更改抓取後的規則嗎？像'SpiderName.rules = new_rules'？ – wolfgang 2015-08-13 09:05:09

在Scrapy中創建可編輯的CrawlSpider規則

回答

相關問題