2017-02-15 94 views
0

我有一個網站名稱https://www.grohe.com/in 在該網頁我想獲得一個類型的浴室水龍頭https://www.grohe.com/in/25796/bathroom/bathroom-faucets/grandera/ 在該頁面有多個產品/相關products.I想每個產品的網址和廢料我寫這樣的data.For ...刮:嵌套的URL數據刮

我items.py文件看起來像

from scrapy.item import Item, Field 

class ScrapytestprojectItem(Item): 
    producturl=Field() 
    imageurl=Field() 
    description=Field() 

蜘蛛的代碼是

import scrapy 
from ScrapyTestProject.items import ScrapytestprojectItem 
class QuotesSpider(scrapy.Spider): 
    name = "nestedurl" 
    allowed_domains = ['www.grohe.com'] 
    start_urls = [ 
    'https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/', 
    ] 

    def parse(self, response): 
    for divs in response.css('div.viewport div.workspace div.float-box'): 
     item = {'producturl': divs.css('a::attr(href)').extract(), 
       'imageurl': divs.css('a img::attr(src)').extract(), 
       'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()} 
     next_page = response.urljoin(item['producturl']) 
     yield scrapy.Request(next_page, callback=self.parse, meta={'item': item}) 

當我運行scrapy ** scrapy抓取nestedurl -o nestedurl.csv ** 創建空文件。 控制檯是

2017-02-15 18:03:11 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024 
2017-02-15 18:03:13 [scrapy] DEBUG: Crawled (200) <GET https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/> (referer: None) 
2017-02-15 18:03:13 [scrapy] ERROR: Spider error processing <GET https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/> (referer: None) 
Traceback (most recent call last): 
File "/usr/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback 
yield next(it) 
     File "/usr/lib/python2.7/dist-  packages/scrapy/spidermiddlewares/offsite.py", line 28, in  process_spider_output 
    for x in result: 
     File "/usr/lib/python2.7/dist- packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr> 
     return (_set_referer(r) for r in result or()) 
     File "/usr/lib/python2.7/dist-  packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> 
     return (r for r in result or() if _filter(r)) 
     File "/usr/lib/python2.7/dist- packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
File "/home/pradeep/ScrapyTestProject/ScrapyTestProject/spiders/nestedurl.py", line 15, in parse 
    next_page = response.urljoin(item['producturl']) 
     File "/usr/lib/python2.7/dist-packages/scrapy/http/response/text.py", line 72, in urljoin 
    return urljoin(get_base_url(self), url) 
     File "/usr/lib/python2.7/urlparse.py", line 261, in urljoin 
    urlparse(url, bscheme, allow_fragments) 
    File "/usr/lib/python2.7/urlparse.py", line 143, in urlparse 
    tuple = urlsplit(url, scheme, allow_fragments) 
    File "/usr/lib/python2.7/urlparse.py", line 176, in urlsplit 
    cached = _parse_cache.get(key, None) 
    TypeError: unhashable type: 'list' 
    2017-02-15 18:03:13 [scrapy] INFO: Closing spider (finished) 
    2017-02-15 18:03:13 [scrapy] INFO: Dumping Scrapy stats: 
      {'downloader/request_bytes': 253, 
      'downloader/request_count': 1, 
     'downloader/request_method_count/GET': 1, 
      'downloader/response_bytes': 31063, 
    'downloader/response_count': 1, 
     'downloader/response_status_count/200': 1, 
      'finish_reason': 'finished', 
     'finish_time': datetime.datetime(2017, 2, 15, 12, 33, 13, 396542), 
     'log_count/DEBUG': 3, 
      'log_count/ERROR': 3, 
      'log_count/INFO': 7, 
      'response_received_count': 1, 
     'scheduler/dequeued': 1, 
     'scheduler/dequeued/memory': 1, 
      'scheduler/enqueued': 1, 
      'scheduler/enqueued/memory': 1, 
      'spider_exceptions/TypeError': 1, 
      'start_time': datetime.datetime(2017, 2, 15, 12, 33, 11, 568424)} 
      2017-02-15 18:03:13 [scrapy] INFO: Spider closed (finished) 

回答

0

我認爲項目divs.css('a::attr(href)').extract()有時會返回這導致向裏urlparse崩潰,因爲它無法散列列表urljoin引線使用時的列表。

0

URL生成不正確。

您應該啓用日誌記錄,並記錄一些消息來調試您的代碼。

import scrapy, logging 
from ScrapyTestProject.items import ScrapytestprojectItem 
class QuotesSpider(scrapy.Spider): 
    name = "nestedurl" 
    allowed_domains = ['www.grohe.com'] 
    start_urls = [ 
    'https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/', 
    ] 

    def parse(self, response): 
    for divs in response.css('div.viewport div.workspace div.float-box'): 
     item = {'producturl': divs.css('a::attr(href)').extract(), 
       'imageurl': divs.css('a img::attr(src)').extract(), 
       'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()} 
     next_page = response.urljoin(item['producturl']) 

     logging.info(next_page) # see what it prints in console. 

     yield scrapy.Request(next_page, callback=self.parse, meta={'item': item}) 
+0

生成的URL被像「/中/ 8257 /浴室/浴室-水龍頭/本質/產品信息/產品= 19408-G145&顏色? = 000&material = 19408000'它應該附加到'www.grohe.in'網址然後它使得感覺 – mvnpgh

+0

loger info [https://www.grohe.com/in/8257/bathroom/bathroom-faucets/essence/product-詳細信息/?product = 33623-G145&color = 000&material = 33623000] .... sameway多個url形成 – mvnpgh

+0

不,您可以手動加入URL,如「www.grohe.in」+ item ['producturl']' – Umair

0
item = {'producturl': divs.css('a::attr(href)').extract(), # <--- issue here 
      'imageurl': divs.css('a img::attr(src)').extract(), 
      'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()} 
    next_page = response.urljoin(item['producturl']) # <--- here item['producturl'] is a list 

爲了解決這個問題使用.extract_first('')

item = {'producturl': divs.css('a::attr(href)').extract_fist(''), 
      'imageurl': divs.css('a img::attr(src)').extract_first(''), 
      'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()} 
    next_page = response.urljoin(item['producturl']) 
+0

在我的spider代碼中,我使用了.extract_first()/。extract_first('').still同樣的輸出沒有change.Samething我在scrapy shell中測試與.extract()它self.it似乎不錯 – mvnpgh

+0

producturl就像---->/in/8257/bathroom/bathroom-faucets/essence/product-details /?product = 19408-G145&color = 000&material = 19408000之後我們形成鏈接爲'https://www.grohe.com/in/8257/bathroom/bathroom-faucets/essence/product-details/?product=19408-G145&color=000&material=19408000' – mvnpgh