0
我試圖抓取amazon grocery uk,並獲取雜貨類別,我使用的是Associate Product Advertising api。我的請求會被排隊,但是請求會過期15分鐘,有些請求會在排隊15分鐘後被抓取,這意味着它們在抓取時會過期併產生400錯誤。我正在考慮一個批量排隊請求的解決方案,但即使這樣做會失敗,如果實施控制批量處理它們,因爲問題是批量準備請求,而不是批量處理它們。不幸的是,Scrapy幾乎沒有關於這個用例的文檔,所以如何分批准備請求?這樣做的由於scrapy中的時間戳驗證,網址過期
from scrapy.spiders import XMLFeedSpider
from scrapy.utils.misc import arg_to_iter
from scrapy.loader.processors import TakeFirst
from crawlers.http import AmazonApiRequest
from crawlers.items import (AmazonCategoryItemLoader)
from crawlers.spiders import MySpider
class AmazonCategorySpider(XMLFeedSpider, MySpider):
name = 'amazon_categories'
allowed_domains = ['amazon.co.uk', 'ecs.amazonaws.co.uk']
marketplace_domain_name = 'amazon.co.uk'
download_delay = 1
rotate_user_agent = 1
grocery_node_id = 344155031
# XMLSpider attributes
iterator = 'xml'
itertag = 'BrowseNodes/BrowseNode/Children/BrowseNode'
def start_requests(self):
return arg_to_iter(
AmazonApiRequest(
qargs=dict(Operation='BrowseNodeLookup',
BrowseNodeId=self.grocery_node_id),
meta=dict(ancestor_node_id=self.grocery_node_id)
))
def parse(self, response):
response.selector.remove_namespaces()
has_children = bool(response.xpath('//BrowseNodes/BrowseNode/Children'))
if not has_children:
return response.meta['category']
# here the request should be configurable to allow batching
return super(AmazonCategorySpider, self).parse(response)
def parse_node(self, response, node):
category = response.meta.get('category')
l = AmazonCategoryItemLoader(selector=node)
l.add_xpath('name', 'Name/text()')
l.add_value('parent', category)
node_id = l.get_xpath('BrowseNodeId/text()', TakeFirst(), lambda x: int(x))
l.add_value('node_id', node_id)
category_item = l.load_item()
return AmazonApiRequest(
qargs=dict(Operation='BrowseNodeLookup',
BrowseNodeId=node_id),
meta=dict(ancestor_node_id=node_id,
category=category_item)
)
你能發佈一些蜘蛛的代碼?通常人們只需用'spider_idle'信號批量請求 - 當蜘蛛閒置時,彈出一批並安排一些請求,請參閱我的相關答案:http://stackoverflow.com/questions/43532976/scrapy-limit-on-start-url /43537446?s=2%7C0.1085#43537446 – Granitosaurus
我已更新評論與參考代碼@Granitosaurus – pranavsharma