scrapy的sitemapcrawler在抓取前處理鏈接

是否可以對sitemapcrawler使用規則？有些網站的舊網站地圖使用http鏈接而不是https。每次抓取它們時，所有鏈接都會被重定向（301），這會在他們（和我）的一方造成無用的流量。我認爲最簡單的解決方案是在鏈接被抓取之前處理鏈接，並將方案從http更改爲https。scrapy的sitemapcrawler在抓取前處理鏈接

我可以用規則來做嗎，還是應該使用默認的中間件，然後讓它基本上解析所有的URL？忽略重定向可能是一個解決方案，但我覺得它「更骯髒」。

來源

2017-06-16 maugch

Scrapy sitemapcrawler具有規則屬性。

請參見： https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.SitemapSpider.sitemap_rules

您可以添加正則表達式，這將過濾非HTTPS URL中。

來源

2017-06-16 11:45:35

其實我不知道規則是什麼，我需要。我認爲他們是在下載了頁面後才應用的，這不是我所需要的。 – maugch

我有這種情況，其中爬網的網站站點地圖包含其他站點地圖。我使用「sitemap_follow」正則表達式來定義應該遵循哪些站點地圖網址，並結合「sitemap_rules」正則表達式來指定僅限哪些鏈接。按預期運行爬網程序，我的爬蟲程序只會追蹤指定的網址，其中包含我的目標數據。 –

您確定在抓取前應用了sitemap_rules嗎？我以爲你只能拋棄你不想要的東西，而不是真正做我期望的東西。我應該試驗，也許 – maugch

你可以嘗試重寫_parse_sitemap SitemapSpider的，檢查實施SitemapSpider，下面的例子：

def _parse_sitemap(self, response): 
    sitemap_generator = super(MySitemapSpider, self)._parse_sitemap(response) 
    if sitemap_generator is None: 
     return 

    for response_url in sitemap_generator: 
     import pdb;pdb.set_trace() 
     # do something with the URL 
     yield response_url

來源

2017-06-19 08:30:16

scrapy的sitemapcrawler在抓取前處理鏈接

回答

相關問題