2016-11-04 113 views
0

我需要使用scrapy用於抓取網頁的所有內部網絡鏈接,使得在例如www.stackovflow.com所有鏈接被抓取。此代碼排序工作的:Scrapy抓取僅供內部鏈接,包括相對鏈接

extractor = LinkExtractor(allow_domains=self.getBase(self.startDomain)) 

    for link in extractor.extract_links(response): 
     self.registerUrl(link.url) 

然而,有一個小問題,如/meta或​​所有相對路徑不抓取作爲不包含基本域stackoverflow.com。任何想法如何解決這一問題?

+1

不scrapy.spidermiddlewares.offsite.OffsiteMiddleware https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.offsite.OffsiteMiddleware是否做到這一點? –

+0

感謝我顯然發現了一些舊的文檔 –

回答

1

如果我理解正確的問題,你要使用scrapy.spidermiddlewares.offsite.OffsiteMiddleware https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.offsite.OffsiteMiddleware

篩選出由 蜘蛛所涉領域之外的URL請求。

This middleware filters out every request whose host names aren’t in the spider’s allowed_domains attribute. All subdomains of any 

列表中的域也是允許的。例如。規則是www.example.org 也將使bob.www.example.org但不www2.example.com也不 example.com。

When your spider returns a request for a domain not belonging to those covered by the spider, this middleware will log a debug message 

類似於此:

DEBUG: Filtered offsite request to 'www.othersite.com': <GET http://www.othersite.com/some/page.html> 

To avoid filling the log with too much noise, it will only print one of these messages for each new domain filtered. So, for example, 

如果www.othersite.com另一個請求進行過濾,沒有日誌消息 將被打印。但是,如果過濾了某個someothersite.com的請求,則會打印一條 消息(但僅用於過濾的第一個請求)。

If the spider doesn’t define an allowed_domains attribute, or the attribute is empty, the offsite middleware will allow all requests. 

If the request has the dont_filter attribute set, the offsite middleware will allow the request even if its domain is not listed in 

允許的域。

我的理解是,網址,然後過濾標準化。

+1

應該OffsiteMiddleware被設置在settings.py禁用? –

+0

只是可以肯定,它的工作原理 –

+0

否「scrapy.spidermiddlewares.offsite.OffsiteMiddleware」:500,看https://doc.scrapy.org/en/latest/topics/settings.html?highlight=OffsiteMiddleware –