2016-08-04 91 views
0

我的蜘蛛運行沒有顯示任何錯誤,但圖像沒有保存在文件夾下面是我scrapy文件:Scrapy圖像下載

Spider.py:

import scrapy 
import re 
import os 
import urlparse 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy.loader.processors import Join, MapCompose, TakeFirst 
from scrapy.pipelines.images import ImagesPipeline 
from production.items import ProductionItem, ListResidentialItem 

class productionSpider(scrapy.Spider): 
    name = "production" 
    allowed_domains = ["someurl.com"] 
    start_urls = [ 
     "someurl.com" 
] 

def parse(self, response): 
    for sel in response.xpath('//html/body'): 
     item = ProductionItem() 
     img_url = sel.xpath('//a[@data-tealium-id="detail_nav_showphotos"]/@href').extract()[0] 
     yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseBasicListingInfo, meta={'item': item}) 

def parseBasicListingInfo(item, response): 
    item = response.request.meta['item'] 
    item = ListResidentialItem() 
    try: 
     image_urls = map(unicode.strip,response.xpath('//a[@itemprop="contentUrl"]/@data-href').extract()) 
     item['image_urls'] = [ x for x in image_urls] 
    except IndexError: 
     item['image_urls'] = '' 

    return item 

settings.py:

from scrapy.settings.default_settings import ITEM_PIPELINES 
from scrapy.pipelines.images import ImagesPipeline 

BOT_NAME = 'production' 

SPIDER_MODULES = ['production.spiders'] 
NEWSPIDER_MODULE = 'production.spiders' 
DEFAULT_ITEM_CLASS = 'production.items' 

ROBOTSTXT_OBEY = True 
DEPTH_PRIORITY = 1 
IMAGE_STORE = '/images' 

CONCURRENT_REQUESTS = 250 

DOWNLOAD_DELAY = 2 

ITEM_PIPELINES = { 
    'scrapy.contrib.pipeline.images.ImagesPipeline': 300, 
} 

items.py

# -*- coding: utf-8 -*- 
import scrapy 

class ProductionItem(scrapy.Item): 
    img_url = scrapy.Field() 

# ScrapingList Residential & Yield Estate for sale 
class ListResidentialItem(scrapy.Item): 
    image_urls = scrapy.Field() 
    images = scrapy.Field() 

    pass 

我的管道文件是空的我不確定我想要添加到pipeline.py文件。

任何幫助,非常感謝。

回答

5

既然你不知道要放什麼東西在我假設你可以使用scrapy提供圖像的默認管道,以便在settings.py文件你可以聲明它像

ITEM_PIPELINES = { 
'scrapy.pipelines.images.ImagesPipeline':1 
} 

同樣的管道,你的圖片路徑錯誤/意味着您將轉到您計算機的絕對根路徑,因此您要麼將絕對路徑設置爲您要保存的位置,要麼只是從運行抓取工具的位置執行相對路徑

IMAGES_STORE = '/home/user/Documents/scrapy_project/images' 

IMAGES_STORE = 'images' 

現在,在蜘蛛您提取的網址,但您不要將其保存到項目

item['image_urls'] = sel.xpath('//a[@data-tealium-id="detail_nav_showphotos"]/@href').extract_first() 

領域有字面上image_urls,如果你使用的是默認管道。

現在,在items.py文件,你需要添加下面的2場(兩者都需要有這個文字名稱)

image_urls=Field() 
images=Field() 

這應該工作

+0

謝謝Rafael,但是仍然沒有圖像填充圖像文件夾,我將管道添加到了settings.py文件。改變了存儲路徑並改變了以下幾行image_urls = map(unicode.strip,response.xpath('// a [@ itemprop =「contentUrl」]/@ data-href')。extract()) item ['image_urls '] = [x for image_urls] to item ['image_urls'] = map(unicode.strip,response.xpath('// a [@ itemprop =「contentUrl」]/@ data-href')。提取()) – user1443063

+0

你不能映射的圖像,如果你想保存多個圖像在一個項​​目中,你必須製作一個數組而不是地圖,這將不會工作 –

+0

我對這一切都很新,我試圖通過改變它來修復它? item ['image_urls'] = response.xpath('// a [@ itemprop =「contentUrl」]/@ data-href')。extract()[0] [0]只能給出一個圖像,但它仍然沒有顯示我是否仍然缺少一些東西,還是仍然是一個數組? – user1443063

4

我的工作最終的結果:

spider.py

import scrapy 
import re 
import urlparse 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy.loader.processors import Join, MapCompose, TakeFirst 
from scrapy.pipelines.images import ImagesPipeline 
from production.items import ProductionItem 
from production.items import ImageItem 

class productionSpider(scrapy.Spider): 
    name = "production" 
    allowed_domains = ["url"] 
    start_urls = [ 
     "startingurl.com" 
    ] 

def parse(self, response): 
    for sel in response.xpath('//html/body'): 
     item = ProductionItem() 
     img_url = sel.xpath('//a[@idd="followclaslink"]/@href').extract()[0] 
     yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseImages, meta={'item': item}) 

def parseImages(self, response): 
    for elem in response.xpath("//img"): 
     img_url = elem.xpath("@src").extract_first() 
     yield ImageItem(image_urls=[img_url]) 

Settings.py

BOT_NAME = 'production' 

SPIDER_MODULES = ['production.spiders'] 
NEWSPIDER_MODULE = 'production.spiders' 
DEFAULT_ITEM_CLASS = 'production.items' 
ROBOTSTXT_OBEY = True 
IMAGES_STORE = '/Users/home/images' 

DOWNLOAD_DELAY = 2 

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1} 
# Disable cookies (enabled by default) 

items.py

# -*- coding: utf-8 -*- 
import scrapy 

class ProductionItem(scrapy.Item): 
    img_url = scrapy.Field() 
# ScrapingList Residential & Yield Estate for sale 
class ListResidentialItem(scrapy.Item): 
    image_urls = scrapy.Field() 
    images = scrapy.Field() 

class ImageItem(scrapy.Item): 
    image_urls = scrapy.Field() 
    images = scrapy.Field() 

管道。py

import scrapy 
from scrapy.pipelines.images import ImagesPipeline 
from scrapy.exceptions import DropItem 

class MyImagesPipeline(ImagesPipeline): 

    def get_media_requests(self, item, info): 
     for image_url in item['image_urls']: 
      yield scrapy.Request(image_url) 

    def item_completed(self, results, item, info): 
     image_paths = [x['path'] for ok, x in results if ok] 
     if not image_paths: 
      raise DropItem("Item contains no images") 
     item['image_paths'] = image_paths 
     return item