2016-04-24 11 views
0

我無法使用Scrapy的圖像管道檢索圖像。從錯誤報告中,我認爲我正在爲Scrapy提供正確的image_urls。但是,Scrapy不是從它們下載圖像,而是返回錯誤:ValueError:請求url中缺少方案:h。使用Scrapy檢索圖像時的值錯誤

這是我第一次使用圖像管道功能,所以我懷疑我犯了一個簡單的錯誤。所有相同的,我會很感激幫助解決它。

下面你會發現我的蜘蛛,設置,項目和錯誤輸出。它們不是MWE,但我認爲它們非常簡單,並且易於理解。

蜘蛛: 進口scrapy 從scrapy.spiders進口CrawlSpider,規則 從scrapy.linkextractors匯入ngamedallions.items LinkExtractor 導入NgamedallionsItem 從scrapy.loader.processors從scrapy.loader進口ItemLoader 從導入TakeFirst scrapy.loader.processors導入從scrapy.http進口申請加入 進口重新

class NGASpider(CrawlSpider): 
    name = 'ngamedallions' 
    allowed_domains = ['nga.gov'] 
    start_urls = [ 
     'http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html' 
    ] 

    rules = (
      Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord', 
    follow=True 
    ),) 

    def parse_CatalogRecord(self, response): 
     CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response) 
     CatalogRecord.default_output_processor = TakeFirst() 
     keywords = "medal|medallion" 
     r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE) 
     if r.search(response.body_as_unicode()): 
      CatalogRecord.add_xpath('title', './/dl[@class="artwork-details"]/dt[@class="title"]/text()') 
      CatalogRecord.add_xpath('accession', './/dd[@class="accession"]/text()') 
      CatalogRecord.add_xpath('inscription', './/div[@id="inscription"]/p/text()') 
      CatalogRecord.add_xpath('image_urls', './/img[@class="mainImg"]/@src') 

      return CatalogRecord.load_item() 

設置:

BOT_NAME = 'ngamedallions' 

SPIDER_MODULES = ['ngamedallions.spiders'] 
NEWSPIDER_MODULE = 'ngamedallions.spiders' 

DOWNLOAD_DELAY=3 

ITEM_PIPELINES = { 
    'scrapy.pipelines.images.ImagesPipeline': 1, 
} 

IMAGES_STORE = '/home/tricia/Documents/Programing/Scrapy/ngamedallions/medallionimages' 

項目:

import scrapy 

class NgamedallionsItem(scrapy.Item): 
    title = scrapy.Field() 
    accession = scrapy.Field() 
    inscription = scrapy.Field() 
    image_urls = scrapy.Field() 
    images = scrapy.Field() 
    pass 

錯誤日誌:

2016-04-24 19:00:40 [scrapy] INFO: Scrapy 1.0.5.post2+ga046ce8 started (bot: ngamedallions) 
2016-04-24 19:00:40 [scrapy] INFO: Optional features available: ssl, http11 
2016-04-24 19:00:40 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'FEED_FORMAT': 'json', 'DOWNLOAD_DELAY': 3} 
2016-04-24 19:00:40 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState 
2016-04-24 19:00:40 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2016-04-24 19:00:40 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2016-04-24 19:00:40 [scrapy] INFO: Enabled item pipelines: ImagesPipeline 
2016-04-24 19:00:40 [scrapy] INFO: Spider opened 
2016-04-24 19:00:40 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-04-24 19:00:40 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-04-24 19:00:40 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: None) 
2016-04-24 19:00:44 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html> (referer: None) 
2016-04-24 19:00:48 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html) 
2016-04-24 19:00:48 [scrapy] ERROR: Error processing {'accession': u'1942.9.163.a', 
'image_urls': u'http://media.nga.gov/public/objects/1/3/1/2/1312-primary-0-440x400.jpg', 
'inscription': u'around circumference: IOHANNES FRANCISCVS GON MA; around bottom circumference: MANTVA', 
'title': u'Gianfrancesco Gonzaga di Rodigo, 1445-1496, Lord of Bozzolo, Sabbioneta, and Viadana 1478 [obverse]'} 
Traceback (most recent call last): 
    File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks 
    current.result = callback(current.result, *args, **kw) 
    File "/usr/lib/pymodules/python2.7/scrapy/pipelines/media.py", line 44, in process_item 
requests = arg_to_iter(self.get_media_requests(item, info)) 
    File "/usr/lib/pymodules/python2.7/scrapy/pipelines/images.py", line 109, in get_media_requests 
return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])] 
    File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 24, in __init__ 
self._set_url(url) 
    File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 55, in _set_url 
self._set_url(url.encode(self.encoding)) 
    File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 59, in _set_url 
raise ValueError('Missing scheme in request url: %s' % self._url) 
ValueError: Missing scheme in request url: h 
2016-04-24 19:00:48 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 
2016-04-24 19:00:51 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html) 
2016-04-24 19:00:52 [scrapy] ERROR: Error processing {'accession': u'1942.9.163.b', 
'image_urls': u'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg', 
'inscription': u'around top circumference: TRINACRIA IANI; upper center: PELORVS ; across center: PA LI; across bottom: BELAVRA', 
'title': u'House between Two Hills [reverse]'} 
Traceback (most recent call last): 
    File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks 
current.result = callback(current.result, *args, **kw) 
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/media.py", line 44, in process_item 
requests = arg_to_iter(self.get_media_requests(item, info)) 
    File "/usr/lib/pymodules/python2.7/scrapy/pipelines/images.py", line 109, in get_media_requests 
return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])] 
    File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 24, in __init__ 
self._set_url(url) 
    File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 55, in _set_url 
self._set_url(url.encode(self.encoding)) 
    File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 59, in _set_url 
    raise ValueError('Missing scheme in request url: %s' % self._url) 
ValueError: Missing scheme in request url: h 
2016-04-24 19:00:55 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html) 
2016-04-24 19:01:02 [scrapy] INFO: Closing spider (finished) 
2016-04-24 19:01:02 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 1609, 
'downloader/request_count': 5, 
'downloader/request_method_count/GET': 5, 
'downloader/response_bytes': 125593, 
'downloader/response_count': 5, 
'downloader/response_status_count/200': 5, 
'dupefilter/filtered': 5, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 4, 24, 23, 1, 2, 938181), 
'log_count/DEBUG': 7, 
'log_count/ERROR': 2, 
'log_count/INFO': 7, 
'request_depth_max': 2, 
'response_received_count': 5, 
'scheduler/dequeued': 5, 
'scheduler/dequeued/memory': 5, 
'scheduler/enqueued': 5, 
'scheduler/enqueued/memory': 5, 
'start_time': datetime.datetime(2016, 4, 24, 23, 0, 40, 851598)} 
2016-04-24 19:01:02 [scrapy] INFO: Spider closed (finished) 

回答

1

的TakeFirst處理器正在image_urls一個字符串時,它應該是一個列表。

地址:

CatalogRecord.image_urls_out = lambda v: v 

編輯:

這也可能是:

CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()