在CrawlSpider的LinkExtractor中設置follow爲true的目的是什麼？

我看到他們有一個CrawlSpider此示例代碼的文檔上：在CrawlSpider的LinkExtractor中設置follow爲true的目的是什麼？

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 

class MySpider(CrawlSpider): 
    name = 'example.com' 
    allowed_domains = ['example.com'] 
    start_urls = ['http://www.example.com'] 

    rules = (
     # Extract links matching 'category.php' (but not matching 'subsection.php') 
     # and follow links from them (since no callback means follow=True by default). 
     Rule(LinkExtractor(allow=('category\.php',), deny=('subsection\.php',))), 

     # Extract links matching 'item.php' and parse them with the spider's method parse_item 
     Rule(LinkExtractor(allow=('item\.php',)), callback='parse_item'), 
    ) 

    def parse_item(self, response): 
     self.logger.info('Hi, this is an item page! %s', response.url) 
     item = scrapy.Item() 
     item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)') 
     item['name'] = response.xpath('//td[@id="item_name"]/text()').extract() 
     item['description'] = response.xpath('//td[@id="item_description"]/text()').extract() 
     return item

從我的理解會發生下列步驟操作：

的Scrapy蜘蛛（MySpider）以上的將獲得從一個響應Scrapy Engine for 'http://www.example.com'鏈接（位於start_url列表中）。然後，LinkExtractor將根據上面提供的兩個規則提取該響應中的所有鏈接。
現在我們假設第二個LinkExtractor（帶回調）得到了3個鏈接（'http://www.example.com/item1.php','http://www.example.com/item2.php','http://www.example.com/item3.php'），而第一個LinkExtractor沒有回調得到了1個鏈接（www.example.com/category1.php）。

對於上面找到的3個鏈接，將簡單調用指定回調parse_item。但是，對於那一個鏈接（www.example.com/category1.php）會發生什麼，因爲沒有與它相關的回調？這兩個LinkExtractors會再次在這一個鏈接上運行嗎？這個假設是否正確？

來源

2017-04-19 CapturedTree

# Extract links matching 'category.php' (but not matching 'subsection.php') 
# and follow links from them (since no callback means follow=True by default).

由於您的Rule對象沒有callback說法，follow參數設置爲True。
因此，在您的示例中，將會抓取1個鏈接並從中提取鏈接，就像第一個頁面完成一樣，這將繼續，直到第一個規則沒有提取更多鏈接或者已經訪問完所有鏈接。

來源

2017-04-19 06:39:49 Granitosaurus

噢好吧我現在看到。那麼基本上這兩個'LinkExtractors'會再次從該鏈接產生的響應中提取正確的鏈接？當你設置「follow = True」時，是否還有一個回調點？ – CapturedTree

不，沒有必要提供回調來跟蹤鏈接，因爲您不想手動解析它們。按照這種方式考慮，'follow = True'意味着它會回調一個_hidden_回調函數，它將只響應所有的響應規則而不執行任何其他操作。 – Granitosaurus

您聲明'因此，在您的示例中，將會抓取1個鏈接並從中提取鏈接。當你說一個鏈接將被抓取時，你基本上是否意味着它將被抓取基於LinkExtractors正確的鏈接？ – CapturedTree

在CrawlSpider的LinkExtractor中設置follow爲true的目的是什麼？

回答

相關問題