如何強制Scrapy刮掉文章網頁後相應的評論網頁？

我想抓取新聞文章和他們的評論與scrapy。就我而言，新聞文章及其評論位於不同的網頁上，如以下示例所示。如何強制Scrapy刮掉文章網頁後相應的評論網頁？

（1）鏈接爲一篇文章。 http://www.theglobeandmail.com/opinion/editorials/if-britain-leaves-the-eu-will-scotland-leave-britain/article32480429/

我希望我的程序能夠理解（1）和（2）是相關的。另外，我想確保（2）在（1）之後被抓取，而不是在中間抓取其他網頁。我使用以下規則來刮取新聞文章的網頁和評論網頁。

rules = (
     Rule(LinkExtractor(allow = r'\/article\d+\/$'), callback="parse_articles"), 
     Rule(LinkExtractor(allow = r'\/article\d+\/comments\/$'), callback="parse_comments") 
)

我試圖在文章中解析函數的使用提出明確要求通話，如下圖所示：

comments_url = response.url + 'comments/' 
print('comments url: ', comments_url) 
return Request(comments_url, callback=self.parse_comments)

但沒有奏效。如何在抓取文章網頁後立即要求抓取工具評論網頁？

來源

2016-10-22 user7009553

您需要手動設置對評論頁面的請求。
您的爬蟲搜索器發現的每篇文章頁面都應該有某處的評論頁面網址，對不對？
在這種情況下，您可以簡單地鏈接parse_article()方法中的審閱頁請求。

from scrapy import Request 
from scrapy.spiders import CrawlSpider 
class MySpider(CrawlSpider): 

    rules = (
     Rule(LinkExtractor(allow = r'\/article\d+\/$'), callback="parse_articles"), 
    ) 
    comments_le = LinkExtractor(allow = r'\/article\d+\/comments\/$') 

    def parse_article(self, response): 
     item = dict() 
     # fill up your item 
     ... 
     # find comments url 
     comments_link = comments_le.extract_links()[0].link 
     if comments_link: 
      # yield request and carry over your half-complete item there too 
      yield Request(comments_link, self.parse_comments, 
          meta={'item':item}) 
     else: 
      yield item 

    def parse_comments(self, response): 
     # retrieve your half-complete item 
     item = response.meta['item'] 
     # add some things to your item 
     ... 
     yield item

來源

2016-10-24 08:26:28 Granitosaurus

謝謝您的回覆！它會轉到相應的評論鏈接，但它仍然不會在文章頁面後面留下評論頁面。它刮擦了其間的其他物品。 – user7009553

@ user7009553是的，因爲scrapy是異步的，它會並行地掃描多個鏈。因此，它可能會刮掉文章並安排評論請求，同時刮掉其他一些文章 - 但是您的鏈條不會丟失訂單。在這種情況下，你的鏈是parse_article-> parse_comments-> yield item，所以你應該得到你期望的結果。 – Granitosaurus

如何強制Scrapy刮掉文章網頁後相應的評論網頁？

回答

相關問題