刮圖像，空響應[scrapy]

我正在使用scrapy的圖像抓取示例。刮圖像，空響應[scrapy]

但我沒有得到任何文件保存在我的電腦：

這是我使用的代碼：

//Items.py//

import scrapy 

class ImgurItem(scrapy.Item): 
    title = scrapy.Field() 
    image_urls = scrapy.Field() 
    images = scrapy.Field()

// settings.py//

BOT_NAME = 'imgur' 

SPIDER_MODULES = ['imgur.spiders'] 
NEWSPIDER_MODULE = 'imgur.spiders' 
ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1} 
IMAGES_STORE = '/home/ubuntu/imgurFront/'

//imgur_spider.py//

import scrapy 

from scrapy.contrib.spiders import Rule, CrawlSpider 
from scrapy.contrib.linkextractors import LinkExtractor 
from imgur.items import ImgurItem 

class ImgurSpider(CrawlSpider): 
    name = 'imgur' 
    allowed_domains = ['imgur.com'] 
    start_urls = ['http://www.imgur.com'] 
    rules = [Rule(LinkExtractor(allow=['/gallery/.*']), 'parse_imgur')] 

    def parse_imgur(self, response): 
     image = ImgurItem() 
     image['title'] = response.xpath(\ 
      "//h2[@id='image-title']/text()").extract() 
     rel = response.xpath("//img/@src").extract() 
     image['image_urls'] = ['http:'+rel[0]] 
     return image

這是響應的類型我得到：

{'image_urls': [u'http:data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7'], 
'images': [], 
'title': []}

這些是我得到的錯誤：你使用

[scrapy] ERROR: File (unknown-error): Error processing file from <GET http://i.imgur.com/BGVbmqM.jpg> referred in <None> 




DEBUG: Retrying <GET http:howard-funk.jpg> (failed 1 times): Connection was refused by other side: 111: Connection refused 


DEBUG: Scraped from <200

來源

2016-01-17 Luis Ramon Ramirez Rodriguez

scrapy的至極的版本？確保你有權在文件夾上書寫。

在最後一種情況下，你可以創建自定義管道http://doc.scrapy.org/en/latest/topics/media-pipeline.html#custom-images-pipeline-example和catch薩姆錯誤

來源

2016-01-18 01:53:34 magexcustomer

版本1.0.3我已經嘗試作爲根相同的問題。順便說一句我也得到這個：連接被拒絕其他方面：111：連接拒絕 –

似乎比網頁塊機器人連接。嘗試模擬http代理（Google上的RandomUserAgentMiddleware）和/或使用TORR或代理與scrapy（settings.py上的HTTP_PROXY）。

來源

2016-01-18 02:08:07 magexcustomer

我怎樣才能確保？在我的設置中： USER_AGENT ='Mozilla/5.0（X11; Linux x86_64; rv：7.0.1）Gecko/20100101 Firefox/7.7' 如果我使用scrapy shell，我可以刮掉頁面圖像。也使用tor似乎是矯枉過正，我嘗試過一次，但沒有成功。 –

Humm，只是看你的編輯。你嘗試從URL下載圖像'http：howard-funk.jpg'，所以肯定這不會起作用。嘗試使用 '如果不是'http：// in rel： rel = response.url +'/'+ rel' Somethink like that。適應你的蜘蛛 – magexcustomer

您這裏有兩個問題：

使用urljoin是推薦的方式來獲得完全合格的URL：
```
image['image_urls'] = [response.urljoin(rel[0])] 
```
你得到base64編碼的圖像數據。您應該跳過使用data:image前綴的值或以不同的方式處理它們（因爲這是圖像文件的內容，您不需要下載它）。

來源

2016-01-18 13:45:42 Rolando

刮圖像，空響應[scrapy]

回答

相關問題