2013-12-08 92 views
0

有無論如何,我可以控制抓取機器人,使其不會抓取我在start_urls列表中指定的原始域以外的內容? 我試過下面是什麼,但它不會對我:(工作:?scrapy:防止抓取機器人在facebook/facebook網站中抓取鏈接

import os 
from scrapy.selector import Selector 
from scrapy.contrib.exporter import CsvItemExporter 
from scrapy.item import Item, Field 
from scrapy.settings import Settings 
from scrapy.settings import default_settings 
from selenium import webdriver 
from urlparse import urlparse 
import csv  
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy import log 

default_settings.DEPTH_LIMIT = 3 
DOWNLOADER_MIDDLEWARES = { 
       'grimes2.middlewares.CustomDownloaderMiddleware': 543, 
       'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None 
        } 

有人可以幫我謝謝

回答

1

allowed_domains 包含域串的可選列表如果啓用了OffsiteMiddleware,則不會遵循對不屬於此列表中指定的域名的URL的請求

看看它如何被使用scrapy tutorial

from scrapy.spider import BaseSpider 

class DmozSpider(BaseSpider): 
    name = "dmoz" 
    allowed_domains = ["dmoz.org"] 
    start_urls = [ 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" 
    ] 

    def parse(self, response): 
     filename = response.url.split("/")[-2] 
     open(filename, 'wb').write(response.body)