在Scrapy中，如何將一個類中生成的url傳遞給腳本中的下一個類？

以下是我的蜘蛛的代碼：在Scrapy中，如何將一個類中生成的url傳遞給腳本中的下一個類？

import scrapy 


class ProductMainPageSpider(scrapy.Spider): 
    name = 'ProductMainPageSpider' 
    start_urls = ['http://domain.com/main-product-page'] 

    def parse(self, response): 
     for product in response.css('article.isotopeItem'): 
      yield { 
       'title': product.css('h3 a::text').extract_first().encode("utf-8"), 
       'category': product.css('h6 a::text').extract_first(), 
       'img': product.css('figure a img::attr("src")').extract_first(), 
       'url': product.css('h3 a::attr("href")').extract_first() 
      } 


class ProductSecondaryPageSpider(scrapy.Spider): 
    name = 'ProductSecondaryPageSpider' 
    start_urls = """ URLS IN product['url'] FROM PREVIOUS CLASS """ 

    def parse(self, response): 
     for product in response.css('article.isotopeItem'): 
      yield { 
       'title': product.css('h3 a::text').extract_first().encode("utf-8"), 
       'thumbnail': product.css('figure a img::attr("src")').extract_first(), 
       'short_description': product.css('div.summary').extract_first(), 
       'description': product.css('div.description').extract_first(), 
       'gallery_images': product.css('figure a img.gallery-item ::attr("src")').extract_first() 
      }

第一類/部件正常工作，如果我刪除第二類/一部分。它使用它中請求的項目正確地生成我的json文件。不過，我需要抓取的網站是一個雙人網站。它有一個產品存檔頁面，可將產品顯示爲縮略圖，標題和類別（並且此信息不在下一頁中）。然後，如果您點擊其中一個縮略圖或標題，就會將其發送到產品上具有特定信息的單個產品頁面。

有很多產品，所以我想將產品['url']中的url作爲「start_urls」列表管道（yield？）到第二個類。但我根本不知道該怎麼做。我的知識還遠遠不夠，甚至不知道我錯過了什麼或錯在哪裏，以便我可以找到解決方案。

在第20行查看我想要做什麼。

來源

2016-12-01 Adriano C R

您不必爲此創建兩個蜘蛛 - 你可以簡單地去下一個網址，並延續您的項目，即：

def parse(self, response): 
    item = MyItem() 
    item['name'] = response.xpath("//name/text()").extract() 
    next_page_url = response.xpath("//a[@class='next']/@href").extract_first() 
    yield Request(next_page_url, 
        self.parse_next, 
        meta={'item': item} # carry over our item 
       ) 

def parse_next(self, response): 
    # get our carried item from response meta 
    item = response.meta['item'] 
    item['description'] = response.xpath("//description/text()").extract() 
    yield item

但是，如果由於某種原因，你真的要分割的邏輯在你的第二個蜘蛛通過它打開/迭代中start_requests()類方法，這將產生的URL，即：這兩個步驟，你可以簡單地將結果保存在一個文件中（scrapy crawl first_spider -o results.json例如JSON）：

import json 
from scrapy import spider 

class MySecondSpider(spider): 
    def start_requests(self): 
     # this overrides `start_urls` logic 
     with open('results.json', 'r') as f: 
      data = json.loads(f.read()) 
     for item in data: 
      yield Request(item['url'])

來源

2016-12-01 06:11:22 Granitosaurus

我不t取得「MyItem（）」部分來自哪裏。那是什麼？實際上你的答案比以前更讓我困惑。 –

@AdrianoBatista dude，'MyItem'只是一個佔位符名稱，不管你的'scrapy.Item'類是什麼，你當然可以用一個簡單的字典替換它，但通常你想要堅持'scrapy.Item'。如果你對這個答案感到困惑，你應該讓教程去熟悉蜘蛛，項目和請求的基本概念：https：//doc.scrapy.org/en/latest/intro/tutorial.html – Granitosaurus

在Scrapy中，如何將一個類中生成的url傳遞給腳本中的下一個類？

回答

相關問題