我正在寫一個Scrapy蜘蛛,每天抓取一組網址。但是,這些網站中的一些非常大,所以我無法每天爬完整個網站,也不想生成必要的大量流量。Scrapy蜘蛛,只能抓取一次網址
一個老問題(here)問了類似的問題。然而,upvoted響應只是指向一個代碼片段(here),它似乎需要某些請求實例,儘管這不在響應中解釋,也不在包含代碼片段的頁面中解釋。
我想弄明白這一點,但發現中間件有點混亂。無論是否使用鏈接的中間件,一個可以多次運行而無需重新創建URL的刮板的完整示例將非常有用。
我已經發布了下面的代碼來實現滾動,但我不一定需要使用這個中間件。任何可以每天抓取並提取新網址的scrapy spider都可以。顯然,一種解決方案是隻寫出一個刮掉的URL的字典,然後檢查以確認每個新URL都不在字典中,但這似乎非常慢/效率低下。
蜘蛛
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from cnn_scrapy.items import NewspaperItem
class NewspaperSpider(CrawlSpider):
name = "newspaper"
allowed_domains = ["cnn.com"]
start_urls = [
"http://www.cnn.com/"
]
rules = (
Rule(LinkExtractor(), callback="parse_item", follow=True),
)
def parse_item(self, response):
self.log("Scraping: " + response.url)
item = NewspaperItem()
item["url"] = response.url
yield item
項目
import scrapy
class NewspaperItem(scrapy.Item):
url = scrapy.Field()
visit_id = scrapy.Field()
visit_status = scrapy.Field()
中間件(ignore.py)
from scrapy import log
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.request import request_fingerprint
from cnn_scrapy.items import NewspaperItem
class IgnoreVisitedItems(object):
"""Middleware to ignore re-visiting item pages if they were already visited
before. The requests to be filtered by have a meta['filter_visited'] flag
enabled and optionally define an id to use for identifying them, which
defaults the request fingerprint, although you'd want to use the item id,
if you already have it beforehand to make it more robust.
"""
FILTER_VISITED = 'filter_visited'
VISITED_ID = 'visited_id'
CONTEXT_KEY = 'visited_ids'
def process_spider_output(self, response, result, spider):
context = getattr(spider, 'context', {})
visited_ids = context.setdefault(self.CONTEXT_KEY, {})
ret = []
for x in result:
visited = False
if isinstance(x, Request):
if self.FILTER_VISITED in x.meta:
visit_id = self._visited_id(x)
if visit_id in visited_ids:
log.msg("Ignoring already visited: %s" % x.url,
level=log.INFO, spider=spider)
visited = True
elif isinstance(x, BaseItem):
visit_id = self._visited_id(response.request)
if visit_id:
visited_ids[visit_id] = True
x['visit_id'] = visit_id
x['visit_status'] = 'new'
if visited:
ret.append(NewspaperItem(visit_id=visit_id, visit_status='old'))
else:
ret.append(x)
return ret
def _visited_id(self, request):
return request.meta.get(self.VISITED_ID) or request_fingerprint(request)
以及在其他回覆中需要找到的網址呢? – eLRuLL
我假設我去過一個URL之後,在該頁面上將找不到新的URL(除了start_urls)。或者我誤解了你的問題? –
沒問題,那麼我認爲你的方法(或類似的方法)可以,想法是保存那些已經完成的方法,如果它們很多,我會推薦使用單獨的數據庫,Scrapy也會保存請求像指紋,這有助於他們自己的重複數據刪除組件。 – eLRuLL