在函數之間傳遞類

我在Scrapy中構建了一個簡單的（ish）解析器，當談到scrapy和Python時，我非常無知:-)在文件item.py中我定義了一個thisItem()，我將它分配給item代碼如下。所有的工作相當游泳，parse使用回調得到parse_dir_content ...但後來我意識到我需要刮額外的一點數據，並創建了另一個功能parse_other_content。如何將item中已有的內容轉換爲parse_other_content？在函數之間傳遞類

import scrapy 
from this-site.items import * 
import re 
import json 

class DmozSpider(scrapy.Spider): 
    name = "ABB" 
    allowed_domains = ["this-site.com.au"] 
    start_urls = [ 
     "https://www.this-site.com.au?page=1", 
     "https://www.this-site.com.au?page=2", 
    ] 

    def parse(self, response): 
     for href in response.xpath('//h3/a/@href'): 
      url = response.urljoin(href.extract()) 
      yield scrapy.Request(url, callback=self.parse_dir_contents) 

    def parse_dir_contents(self, response): 
     for sel in response.xpath('//h1[@itemprop="name"]'): 
      item = thisItem() 
      item['title'] = sel.xpath('text()').extract() 
      item['rate'] = response.xpath('//div[@class="rate"]/div/span/text()').extract() 
      so = re.search(r'\d+', response.url) 
      propID = so.group() 
      item['propid'] = propID 
      item['link'] = response.url 
      yield scrapy.Request("https://www.this-site.com.au/something?listing_id="+propID,callback=self.parse_other_content) 
      #yield item 

    def parse_other_content(self, reponse): 
      sel = json.loads(reponse.body) 
      item['rate_detail'] = sel["this"][0]["that"] 
      yield item

我知道我錯過了一些簡單的東西，但我似乎無法弄清楚。

來源

2016-03-01 Jeroen

這個問題還不清楚。你只是想發送'item'到另一個方法，比如作爲一個函數參數，或者讓它成爲整個'DmozSpider'類可見的變量？ –

方法1有我的偏好，方法2也會工作，我猜。 – Jeroen

每scrapy文檔（http://doc.scrapy.org/en/1.0/topics/request-response.html#topics-request-response-ref-request-callback-arguments）：

在某些情況下，你可能有興趣在傳遞參數給這些回調函數，所以你可以在以後收到的參數，在第二個回調。您可以使用Request.meta屬性。

在你的情況，我會做這樣的事情：

def parse_dir_contents(self, response): 
    for sel in response.xpath('//h1[@itemprop="name"]'): 
     item = thisItem() 
     ... 
     request = scrapy.Request("https://www.this-site.com.au/something?listing_id="+propID,callback=self.parse_other_content) 
     request.meta['item'] = item 
     yield request 

def parse_other_content(self, response): 
    item = response.meta['item'] 
    # do something with the item 
    return item

據史蒂夫（見註釋），您也可以通過meta數據字典作爲關鍵字參數傳遞給Request構造函數如下所示：

def parse_dir_contents(self, response): 
    for sel in response.xpath('//h1[@itemprop="name"]'): 
     item = thisItem() 
     ... 
     request = scrapy.Request("https://www.this-site.com.au/something?listing_id="+propID,callback=self.parse_other_content, meta={'item':item}) 
     yield request

來源

2016-03-01 02:23:49 ngoue

通過什麼機制已經在'item'內提供給'parse_other_content'的內容？這個答案似乎並不完整......但就像我說的，當談到Python時，我很無知，所以隨時給我啓發。 – Jeroen

該機制內置於Scrapy。作者允許您將數據存儲在'request.meta'字典中，並且構建庫以將數據傳遞到下一個回調函數。 – ngoue

很酷，但不應該在這裏明確傳遞：'request = scrapy.Request（「https://www.this-site.com.au/something?listing_id=」+ propID，callback = self.parse_other_content ）'也許作爲第三個參數？ – Jeroen

您可以讓item可見於parse_other_content()，改成self.item，或發送時作爲函數的參數。（第一個可能會更容易。）

對於第一個解決方案，只需將self.添加到對項目變量的任何引用。這使整個班級都可以看到。

def parse_dir_contents(self, response): 
    for sel in response.xpath('//h1[@itemprop="name"]'): 
     self.item = thisItem() 
     self.item['title'] = sel.xpath('text()').extract() 
     self.item['rate'] = response.xpath('//div[@class="rate"]/div/span/text()').extract() 
     so = re.search(r'\d+', response.url) 
     propID = so.group() 
     self.item['propid'] = propID 
     self.item['link'] = response.url 
     yield scrapy.Request("https://www.this-site.com.au/something?listing_id="+propID,callback=self.parse_other_content) 
     #yield item 

def parse_other_content(self, reponse): 
     sel = json.loads(reponse.body) 
     self.item['rate_detail'] = sel["this"][0]["that"] 
     yield self.item

來源

2016-03-01 02:26:57

我不會推薦這個。 Scrapy有能力將項目傳遞給回調。看到我的答案。 – ngoue

@ mcjoejoe0911是對的，不要這樣做。 Scrapy是異步的，不能保證self.item中的數據與第二種方法中正在解析的數據相關。你必須通過'response.meta'傳遞它。 – Steve

在函數之間傳遞類

回答

相關問題