如何設置scrapy

深度限制我使用這個蜘蛛抓取網頁和下載的圖片：如何設置scrapy

import scrapy 

from scrapy.contrib.spiders import Rule, CrawlSpider 
from scrapy.contrib.linkextractors import LinkExtractor 
from imgur.items import ImgurItem 
import re 

from urlparse import urljoin 

class ImgurSpider(CrawlSpider): 
    name = 'imgur' 
    allowed_domains = ['some.page'] 

    start_urls = [u'some.page'] 

    rules = [Rule(LinkExtractor(allow=['.*']), 'parse_imgur')] 

    def parse_imgur(self, response): 
     image = ImgurItem() 
     image['title'] = 'a' 

     relative_urls = re.findall('= "([^"]+.jpg)',response.body) 
     image['image_urls'] = [urljoin(response.url, url) for url in relative_urls] 


     return image

但我在這裏有兩個問題，第一個是，我不能設置深度限制到一個altought我使用了「-s DEPTH_LIMIT = 1」當我運行蜘蛛：

scrapy爬行imgur -s DEPTH_LIMIT = 1

的第二個問題是，我得到的所有除主頁以外的網頁圖像：

我沒有從該頁面獲取任何圖像。

編輯。

一個@ Javitronxo

像這樣：

def parse(self, response): 
    image = ImgurItem() 
    image['title'] = 'a' 

    relative_urls = re.findall('= "([^"]+.jpg)',response.body) 
    image['image_urls'] = [urljoin(response.url, url) for url in relative_urls] 


    return image

我沒有得到任何圖像的方式。

來源

2016-02-01 Luis Ramon Ramirez Rodriguez

因爲在你的代碼本規則：

rules = [Rule(LinkExtractor(allow=['.*']), 'parse_imgur')]

蜘蛛提取網頁中的所有鏈接，因此他們最終將緊隨其後。

如果你只是想爬在主頁的圖片，我會建議刪除的規則，改變方法頭重寫默認parse：

def parse(self, response):

這樣的蜘蛛會開始抓取圖像在start_url字段中，返回該對象並完成執行。

來源

2016-02-01 14:23:07 Javitronxo

我應該在解析方法中放什麼？ –

只需更改標題，參數的響應將作爲開始url，以便您可以直接創建該項目，在正文中查找圖像並將其返回。 – Javitronxo

如何設置scrapy

回答

相關問題