2017-09-02 53 views
0

我試圖抓取知名的英國零售商的網站,並得到一個AttributeError如下:Scrapy SitemapSpider不工作

nl_env/lib/python3.6/site-packages/scrapy/spiders/sitemap.py", line 52, in _parse_sitemap for r, c in self._cbs:

AttributeError: 'NlSMCrawlerSpider' object has no attribute '_cbs'

這可能是我沒有完全構思如何SitemapSpider工作 - 看我下面的代碼:

class NlSMCrawlerSpider(SitemapSpider): 
name = 'nl_smcrawler' 
allowed_domains = ['newlook.com'] 
sitemap_urls = ['http://www.newlook.com/uk/sitemap/maps/sitemap_uk_product_en_1.xml'] 
sitemap_follow = ['/uk/womens/clothing/'] 

# sitemap_rules = [ 
#  ('/uk/womens/clothing/', 'parse_product'), 
# ] 


def __init__(self): 
    self.driver = webdriver.Safari() 
    self.driver.set_window_size(800,600) 
    time.sleep(2) 


def parse_product(self, response): 
    driver = self.driver 
    driver.get(response.url) 
    time.sleep(1) 

    # Collect products 
    itemDetails = driver.find_elements_by_class_name('product-details-page content') 


    # Pull features 
    desc = itemDetails[0].find_element_by_class_name('product-description__name').text 
    href = driver.current_url 

    # Generate a product identifier 
    identifier = href.split('/p/')[1].split('?comp')[0] 
    identifier = int(identifier) 

    # datetime 
    dt = date.today() 
    dt = dt.isoformat() 

    # Price Symbol removal and integer conversion 
    try: 
     priceString = itemDetails[0].find_element_by_class_name('price product-description__price').text 
    except: 
     priceString = itemDetails[0].find_element_by_class_name('price--previous-price product-description__price--previous-price ng-scope').text 
    priceInt = priceString.split('£')[1] 
    originalPrice = float(priceInt) 

    # discountedPrice Logic 
    try: 
     discountedPriceString = itemDetails[0].find_element_by_class_name('price price--marked-down product-description__price').text 
     discountedPriceInt = discountedPriceString.split('£')[1] 
     discountedPrice = float(discountedPriceInt) 
    except: 
     discountedPrice = 'N/A' 

    # NlScrapeItem 
    item = NlScrapeItem() 

    # Append product to NlScrapeItem 
    item['identifier'] = identifier 
    item['href'] = href 
    item['description'] = desc 
    item['originalPrice'] = originalPrice 
    item['discountedPrice'] = discountedPrice 
    item['firstSighted'] = dt 
    item['lastSighted'] = dt 

    yield item 

另外,不要猶豫,要求任何進一步的詳情,請參閱Scrapy包擺脫錯誤(link - github)內的鏈接sitemap並鏈接到實際的文件。我們將衷心感謝您的幫助。

編輯:一個思想 看2nd link(從Scrapy包),我可以看到_cbs在def __init__(self, *a, **kw):函數初始化 - 的事實是,我有我自己的初始化邏輯把它扔了嗎?

回答

1

在你的刮刀中有兩個問題。一個是現在你已經定義了一個新__init__和覆蓋基類__init____init__方法

def __init__(self): 
    self.driver = webdriver.Safari() 
    self.driver.set_window_size(800, 600) 
    time.sleep(2) 

。哪個不是你的init調用的,因此_cbs沒有被初始化。您可以輕鬆地改變你的init方法如下

def __init__(self, *a, **kw): 
    super(NlSMCrawlerSpider, self).__init__(*a, **kw) 

    self.driver = webdriver.Safari() 
    self.driver.set_window_size(800, 600) 
    time.sleep(2) 

下一步SitemapScraper總是發送響應解析方法解決這個問題。你還沒有定義解析方法。所以我添加了一個簡單的打印網址

def parse(self, response): 
    print(response.url) 
+0

謝謝你 - 工作!太棒了! – Philipp