Scrapy在特定深度抓取簡單的網站

我想要取消關於一些問題和答案的3深度網站。它有一個簡單的結構如下所示：Scrapy在特定深度抓取簡單的網站

第二深度 - >包含元數據（問題描述）

第三深度 - >包含實際數據（問題和答案）

/prob 
    +-> /prob/problemLists.html 
    +-> /prob/problem123456.html

我寫了如下的Scrapy代碼，使用response.meta['depth']作爲條件。

有沒有更好的方法來做到這一點？

class DmzSpider(CrawlSpider): 
    rules = (
     Rule(SgmlLinkExtractor(deny=('index\.htm',callback='parse_list'))), 
    ) 

    def parse_list(self, response): 
     if response.meta['depth'] == 2: 
     # Scrap descriptions ... 
     return items 

     elif response.meta['depth'] ==3: 
      parse_item(response) 

    def parse_item(self, response): 
     # Parse items and save it according to prob_id... 

     return items

另外我曾嘗試3個以下選項，其中沒有人曾在總結request_depth_max = 1： 1.添加：從scrapy.conf導入設置 settings.overrides [ 'DEPTH_LIMIT'] = 2 蜘蛛文件 2.運行與-s選項命令行：的/ usr /斌/ scrapy爬行-s DEPTH_LIMIT = 2 mininova.org 3.添加到settings.py中和scrapy.cfg： DEPTH_LIMIT = 2

它應該如何配置爲超過1？

來源

2012-06-04 Leonard Huang

不知道這是你在找什麼，但是：你可以使用它在默認情況下啓用DepthLimitMiddleware設置的深度限制。有關其設置，請參閱：http://doc.scrapy.org/zh/latest/topics/spider-middleware.html#module-scrapy.contrib.spidermiddleware.depth –

我想要的是抓取第2和第3深度的頁面。沒有更深。我會編輯我的問題以使其更清楚。 –

您可以設置depth limit 3在這種情況下，看到settings page in Scrapy documentation

來源

2012-07-24 15:09:14

我在哪裏可以找到設置文件？ – Sekai

Scrapy在特定深度抓取簡單的網站

回答

相關問題