2017-02-16 61 views
0

的第二電平在下面的代碼解析函數執行大約32倍(福爾環32 HREF的實測值)在同一開頭每個子鏈路該去颳去數據(32個單個網址parse_next功能) 。但parse_next功能只執行一次(單程)/不叫(輸出CSV文件empty.can任何人能幫助我在哪裏,我沒有錯刮:刮URL

import scrapy 
import logging 

logger = logging.getLogger('mycustomlogger') 

from ScrapyTestProject.items import ScrapytestprojectItem 
class QuotesSpider(scrapy.Spider): 
    name = "nestedurl" 
    allowed_domains = ['www.grohe.in'] 
    start_urls = [ 
     'https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/', 
def parse(self, response): 
    logger.info("Parse function called on %s", response.url) 
    for divs in response.css('div.viewport div.workspace div.float-box'): 
     item = {'producturl': divs.css('a::attr(href)').extract_first(), 
       'imageurl': divs.css('a img::attr(src)').extract_first(), 
       'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()} 
     next_page = response.urljoin(item['producturl']) 
     #logger.info("This is an information %s", next_page) 
     yield scrapy.Request(next_page, callback=self.parse_next, meta={'item': item}) 
     #yield item 

def parse_next(self, response): 
    item = response.meta['item'] 
    logger.info("Parse function called on2 %s", response.url) 
    item['headline'] = response.css('div#content a.headline::text').extract() 
    return item 
    #response.css('div#product-variants a::attr(href)').extract() 
+0

檢查您的循環,並應正常工作。因此,日誌中應該存在某種錯誤。你有沒有試圖用DEBUG日誌級別運行蜘蛛?這應該給你一些指示哪裏出錯的地方。 – Casper

回答

0

確定,所以一些事情出錯:

  • 壓痕
  • start_urls列表不與[
  • allowed_domains封閉使用的域擴展。在同時要刮.COM

工作下面的代碼:

import scrapy 
import logging 

class QuotesSpider(scrapy.Spider): 
    name = "nestedurl" 
    allowed_domains = ['www.grohe.com'] 
    start_urls = [ 
     'https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/' 
    ] 
    def parse(self, response): 
     # logger.info("Parse function called on %s", response.url) 
     for divs in response.css('div.viewport div.workspace div.float-box'): 
      item = {'producturl': divs.css('a::attr(href)').extract_first(), 
        'imageurl': divs.css('a img::attr(src)').extract_first(), 
        'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()} 
      next_page = response.urljoin(item['producturl']) 
      #logger.info("This is an information %s", next_page) 
      yield scrapy.Request(next_page, callback=self.parse_next, meta={'item': item}) 
      #yield item 

    def parse_next(self, response): 
     item = response.meta['item'] 
     # logger.info("Parse function called on2 %s", response.url) 
     item['headline'] = response.css('div#content a.headline::text').extract() 
     return item 
     #response.css('div#product-variants a::attr(href)').extract() 

注意:刪除一些日誌記錄/項目管道,因爲這些沒有在我的機器上定義。

+0

Thanks.if我評論** allowed_domains **它正在工作。我有很多網址列表。每個url頁面都包含他們自己的瀏覽器HTML演示方式。 如何編寫規範化腳本來遍歷那些列出網頁URL的HTML並廢除數據,而不考慮格式。 – pradeep