2017-09-21 68 views
0

我一直在試圖製作我的第一個抓取工具,並且已經創建了我所需要的(獲得1º商店和2º商店的貨運信息和價格),但使用2個抓取工具而不是1個,這裏有一個大瓶子。Scrapy檢測Xpath是否存在

當there'are超過1個店輸出的結果是:

In [1]: response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div[@class="shipping"]/p//text()').extract() 
Out[1]: 
[u'ENV\xcdO 3,95\u20ac ', 
u'ENV\xcdO GRATIS', 
u'ENV\xcdO GRATIS', 
u'ENV\xcdO 4,95\u20ac '] 

若要僅獲取我使用的第二個結果:

In [2]: response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div[@class="shipping"]/p//text()')[1].extract() 
Out[2]: u'ENV\xcdO GRATIS' 

但是,當沒有第二個結果(只1商店)我得到:

IndexError: list index out of range 

而爬行器跳過整個頁面,即使其他項目有dat一個...

經過幾次嘗試後,我決定做一個快速解決方案來獲得結果,第一個商店的2個履帶1和第二個的履帶1,但現在我想幹淨的只做1履帶。

一些幫助,提示或建議將不勝感激,這是我第一次嘗試使用scrapy製作遞歸爬蟲,有點像它。

有代碼:履帶用時是1個多店的正確的格式

# -*- coding: utf-8 -*- 
import scrapy 
from Guapalia.items import GuapaliaItem 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
class GuapaliaSpider(CrawlSpider): 
    name = "guapalia" 
    allowed_domains = ["guapalia.com"] 
    start_urls = (
     'https://www.guapalia.com/perfumes?page=1', 
     'https://www.guapalia.com/maquillaje?page=1', 
     'https://www.guapalia.com/cosmetica?page=1', 
     'https://www.guapalia.com/linea-de-bano?page=1', 
     'https://www.guapalia.com/parafarmacia?page=1', 
     'https://www.guapalia.com/solares?page=1', 
     'https://www.guapalia.com/regalos?page=1', 
    ) 
    rules = (
     Rule(LinkExtractor(restrict_xpaths="//div[@class='js-pager']/a[contains(text(),'Siguientes')]"),follow=True), 
     Rule(LinkExtractor(restrict_xpaths="//div[@class='list-display__item list-display__item--product']/div/a[@class='col-xs-10 col-sm-10 col-md-12 clickOnProduct']"),callback='parse_articles',follow=True), 
    ) 
    def parse_articles(self, response): 
     item = GuapaliaItem() 
     articles_urls = response.url 
     articles_first_shop = response.xpath('//div[@class="container-fluid list-display-box--best-deal"]/div/div/div/div[@class="retailer-logo autoimage-container"]/img/@title').extract() 
     articles_first_shipping = response.xpath('//div[@class="container-fluid list-display-box--best-deal"]/div/div/div/div[@class="shipping"]/p//text()').extract() 
     articles_second_shop = response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div/img/@title')[1].extract() 
     articles_second_shipping = response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div[@class="shipping"]/p//text()')[1].extract() 
     articles_name = response.xpath('//div[@id="ProductDetail"]/@data-description').extract() 
     item['articles_urls'] = articles_urls 
     item['articles_first_shop'] = articles_first_shop 
     item['articles_first_shipping'] = articles_first_shipping 
     item['articles_second_shop'] = articles_second_shop if articles_second_shop else 'N/A' 
     item['articles_second_shipping'] = articles_second_shipping 
     item['articles_name'] = articles_name 
     yield item 

基本輸出:

2017-09-21 09:53:11 [scrapy] DEBUG: Crawled (200) <GET https://www.guapalia.com/zen-edp-vaporizador-100-ml-75355> (referer: https://www.guapalia.com/perfumes?page=1) 
2017-09-21 09:53:11 [scrapy] DEBUG: Scraped from <200 https://www.guapalia.com/zen-edp-vaporizador-100-ml-75355> 
{'articles_first_shipping': [u'ENV\xcdO GRATIS'], 
'articles_first_shop': [u'DOUGLAS'], 
'articles_name': [u'ZEN edp vaporizador 100 ml'], 
'articles_second_shipping': u'ENV\xcdO 3,99\u20ac ', 
'articles_second_shop': u'BUYSVIP', 
'articles_urls': 'https://www.guapalia.com/zen-edp-vaporizador-100-ml-75355'} 

問題是,當不存在第二店因爲我的代碼在現場第二店

IndexError:列表索引超出範圍

解決方案由於@Tarun Lalwani

def parse_articles(self, response): 
    item = GuapaliaItem() 
    articles_urls = response.url 
    articles_first_shop = response.xpath('//div[@class="container-fluid list-display-box--best-deal"]/div/div/div/div[@class="retailer-logo autoimage-container"]/img/@title').extract() 
    articles_first_shipping = response.xpath('//div[@class="container-fluid list-display-box--best-deal"]/div/div/div/div[@class="shipping"]/p//text()').extract() 
    articles_second_shop = response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div/img/@title') 
    articles_second_shipping = response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div[@class="shipping"]/p//text()') 
    articles_name = response.xpath('//div[@id="ProductDetail"]/@data-description').extract() 
    if len(articles_second_shop) > 1: 
     item['articles_second_shop'] = articles_second_shop[1].extract() 
    else: 
     item['articles_second_shop'] = 'Not Found' 
    if len(articles_second_shipping) > 1: 
     item['articles_second_shipping'] = articles_second_shipping[1].extract() 
    else: 
     item['articles_second_shipping'] = 'Not Found' 
    item['articles_urls'] = articles_urls 
    item['articles_first_shop'] = articles_first_shop 
    item['articles_first_shipping'] = articles_first_shipping 
    item['articles_name'] = articles_name 
    yield item 

回答

2

您需要首先得到的結果在一個變量。然後你可以根據它的長度做出決定

texts = response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div[@class="shipping"]/p//text()') 

if len(texts) > 1: 
    data = texts[1].extract() 
elif len(text) == 1: 
    data = texts[0].extract() 
else 
    data ="Not found" 
+0

非常感謝你,它完美的工作! (非常邏輯的回答,咖啡時間)。 –