Scrapy抓取僅供內部鏈接，包括相對鏈接

我需要使用scrapy用於抓取網頁的所有內部網絡鏈接，使得在例如www.stackovflow.com所有鏈接被抓取。此代碼排序工作的：Scrapy抓取僅供內部鏈接，包括相對鏈接

extractor = LinkExtractor(allow_domains=self.getBase(self.startDomain)) 

    for link in extractor.extract_links(response): 
     self.registerUrl(link.url)

然而，有一個小問題，如/meta或所有相對路徑不抓取作爲不包含基本域stackoverflow.com。任何想法如何解決這一問題？

來源

2016-11-04 Lars Nielsen

不scrapy.spidermiddlewares.offsite.OffsiteMiddleware https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.offsite.OffsiteMiddleware是否做到這一點？ –

感謝我顯然發現了一些舊的文檔 –

如果我理解正確的問題，你要使用scrapy.spidermiddlewares.offsite.OffsiteMiddleware https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.offsite.OffsiteMiddleware

篩選出由蜘蛛所涉領域之外的URL請求。
This middleware filters out every request whose host names aren’t in the spider’s allowed_domains attribute. All subdomains of any 
列表中的域也是允許的。例如。規則是www.example.org 也將使bob.www.example.org但不www2.example.com也不 example.com。
When your spider returns a request for a domain not belonging to those covered by the spider, this middleware will log a debug message 
類似於此：
DEBUG: Filtered offsite request to 'www.othersite.com': <GET http://www.othersite.com/some/page.html> 

To avoid filling the log with too much noise, it will only print one of these messages for each new domain filtered. So, for example, 
如果www.othersite.com另一個請求進行過濾，沒有日誌消息將被打印。但是，如果過濾了某個someothersite.com的請求，則會打印一條消息（但僅用於過濾的第一個請求）。
If the spider doesn’t define an allowed_domains attribute, or the attribute is empty, the offsite middleware will allow all requests. 

If the request has the dont_filter attribute set, the offsite middleware will allow the request even if its domain is not listed in 
允許的域。

我的理解是，網址，然後過濾標準化。

來源

2016-11-04 13:47:44

應該OffsiteMiddleware被設置在settings.py禁用？ –

只是可以肯定，它的工作原理 –

否「scrapy.spidermiddlewares.offsite.OffsiteMiddleware」：500，看https://doc.scrapy.org/en/latest/topics/settings.html?highlight=OffsiteMiddleware –

Scrapy抓取僅供內部鏈接，包括相對鏈接

回答

相關問題