爬行使用Scrapy從一個規則的URL列表迭代參數

我是Scrapy和Python的新手。我要做到以下幾點：爬行使用Scrapy從一個規則的URL列表迭代參數

訪問的URL，並得到含有「店/產品」作爲URL的一部分，各個環節。鏈接如下所示：「http://www.example.com/shop/products/category-name」

廢棄start_urls的網址並獲取總產品的數量，總計。在代碼TOTAL = num_items_per_category。

在末尾添加「？排序=頂部&大小= 12 &開始= PARAM」的URL。在PARAM < = TOTAL的情況下，PARAM必須在每次迭代時增加12。最終的網址將是「http://www.example.com/shop/products/category-name?sort=Top&size=12&start=PARAM」

從start_urls生成另一個url並再次啓動步驟2。

這裏是我的蜘蛛代碼：

import scrapy import re import datetime from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor from scrapy.http.request import Request class MySpider(CrawlSpider): name = 'my_spider' allowed_domains = ['example.com'] start_urls = ['http://www.example.com/shop/products'] rules = ( Rule(LxmlLinkExtractor( restrict_xpaths=('.//li[@class="item"]/a')), follow=False, callback='parse_list' ), ) def parse_list(self, response): SET_SELECTOR = '.product' for item in response.css(ITEM_SELECTOR): NAME_SELECTOR = 'div[@class="product"]/h2/a/@title' yield { 'name': item.xpath(NAME_SELECTOR).extract_first() } NUM_ITEMS_PER_CATEGORY_SELECTOR = 'div[@id="search"]/@data-count' num_items_per_category = item.xpath(NUM_ITEMS_PER_CATEGORY_SELECTOR).extract_first() nipc = int(0 if num_items_per_category is None else num_items_per_category) try: next_start = response.meta["next_start"] except KeyError: next_start = 0 if next_start <= nipc: yield scrapy.Request( response.urljoin('%s?sort=Top&size=12&start=%s' % (response.url, next_start)), meta={"next_start": next_start + 12}, dont_filter=True, callback = self.parse_list )

的問題是：

我不知道是否存在任何CSS選擇器或正則表達式來在規則中使用來選擇我想要的每個鏈接。在代碼中，我正在訪問一個路徑，我知道有一些我想要的鏈接，但頁面上還有更多。

該代碼不工作，因爲我期待。看來next_start在每次迭代中都不會遞增12。代碼正在獲取生成的start_urls列表中每個網址的前12個元素。我是否正確使用meta變量？或者，我可能需要每個類別頁面的另一個第一個廢品才能獲得總計數，然後才能使用它來迭代它？或者，也許我需要另一種方法使用start_requests ...你覺得呢？

來源

2016-12-29 ArtStack

你的蜘蛛究竟是訪問的URL http://www.example.com/shop/products，提取裏面<li class="item">元素的所有鏈接，並使用parse_list回調獲取所有的人。正如我所看到的，這不是您正在等待的行爲 - 相反，您應該在規則中使用一些包含種子網址和提取器的起始網址，其格式爲allow=r"shop/products"。

也是這部分'%s?sort=Top&size=12&start=%s' % (response.url, next_start)是錯誤的，因爲response.url包含完整的URL，包括GET參數，因此每次您將參數附加到現有參數字符串的部分，如?sort=Top&size=12&start=0?sort=Top&size=12&start=12?sort=Top&size=12&start=24。在附加新字符串之前從url中清除參數，或者使用FormRequest作爲傳遞參數的更方便的方法。

順便說一下，Scrapy具有非常方便的調試控制檯，您可以使用scrapy.shell.inspect_response從蜘蛛的任何部分調用它。

來源

2016-12-30 15:33:24 mizhgun

謝謝，mizhgun！這正是發生的事情！我會和debbuger一起去測試那種事情。 – ArtStack

爬行使用Scrapy從一個規則的URL列表迭代參數

回答

相關問題