scrapy拒絕規則不能被忽略

我有一些規則，我動態地從數據庫中抓取並在我的蜘蛛添加它們：scrapy拒絕規則不能被忽略

 self.name = exSettings['site'] 
     self.allowed_domains = [exSettings['root']] 
     self.start_urls = ['http://' + exSettings['root']] 

     self.rules = [Rule(SgmlLinkExtractor(allow=(exSettings['root'] + '$',)), follow= True)] 
     denyRules = [] 

     for rule in exSettings['settings']: 
      linkRegex = rule['link_regex'] 

      if rule['link_type'] == 'property_url': 
       propertyRule = Rule(SgmlLinkExtractor(allow=(linkRegex,)), follow=True, callback='parseProperty') 
       self.rules.insert(0, propertyRule) 
       self.listingEx.append({'link_regex': linkRegex, 'extraction': rule['extraction']}) 

      elif rule['link_type'] == 'project_url': 
       projectRule = Rule(SgmlLinkExtractor(allow=(linkRegex,)), follow=True) #not set to crawl yet due to conflict if same links appear for both 
       self.rules.insert(0, projectRule) 

      elif rule['link_type'] == 'favorable_url': 
       favorableRule = Rule(SgmlLinkExtractor(allow=(linkRegex,)), follow=True) 
       self.rules.append(favorableRule) 

      elif rule['link_type'] == 'ignore_url': 
       denyRules.append(linkRegex) 

     #somehow all urls will get ignored if allow is empty and put as the first rule 
     d = Rule(SgmlLinkExtractor(allow=('testingonly',), deny=tuple(denyRules)), follow=True) 

     #self.rules.insert(0,d) #I have tried with both status but same results 
     self.rules.append(d)

我有我的數據庫以下規則：

link_regex: /listing/\d+/.+ (property_url) 
link_regex: /project-listings/.+ (favorable_url) 
link_regex: singapore-property-listing/ (favorable_url) 
link_regex: /mrt/ (ignore_url)

我看到這在我的日誌：

http://www.propertyguru.com.sg/singapore-property-listing/property-for-sale/mrt/125/ang-mo-kio-mrt-station> (referer: http://www.propertyguru.com.sg/listing/8277630/for-sale-thomson-grand-6-star-development-)

是不是/mrt/應該被拒絕？爲什麼我仍然會抓取上述鏈接？

來源

2012-01-09 goh

據我所知deny參數必須在相同的SgmlLinkExtractor，它有allow模式。

在你的情況下，你創建了SgmlLinkExtractor，它允許favorable_url（'singapore-property-listing/'）。但是這個提取器沒有任何deny模式，所以它也提取/mrt/。

要解決此問題，您應該添加deny模式到記者SgmlLinkExtractor s。另請參閱related question。

也許有一些方法來定義全局deny模式，但我沒有看到它們。

來源

2012-01-10 14:44:58 reclosedev

是的，你是對的。在查看源代碼後，拒絕會簡單地跳過匹配的鏈接，但它仍然會在後續規則中將跳過的鏈接傳遞給提取器。 – goh 2012-01-11 03:26:40

scrapy拒絕規則不能被忽略

回答

相關問題