2017-02-13 42 views
-1

我寫過scrapy抓取工具,但我需要添加從命令行讀取一些參數的能力,然後在我的蜘蛛類中填充一些靜態字段。我還需要重寫初始化程序,以便填充一些spider字段。在scrapy中放入我的抓取工具的參數

scrapy runspider crawler.py arg1 arg2 

如何實現這一點:

import scrapy 
from scrapy.spiders import Spider 
from scrapy.http import Request 
import re 


class TutsplusItem(scrapy.Item): 
    title = scrapy.Field() 


class MySpider(Spider): 
    name = "tutsplus" 
    allowed_domains = ["bbc.com"] 
    start_urls = ["http://www.bbc.com/"] 

    def parse(self, response): 
     links = response.xpath('//a/@href').extract() 
     # We stored already crawled links in this list 
     crawledLinks = [] 

     for link in links: 
      # If it is a proper link and is not checked yet, yield it to the Spider 
      # if linkPattern.match(link) and not link in crawledLinks: 
      if not link in crawledLinks: 
       link = "http://www.bbc.com" + link 
       crawledLinks.append(link) 
       yield Request(link, self.parse) 

     titles = response.xpath('//a[contains(@class, "media__link")]/text()').extract() 
     for title in titles: 
      item = TutsplusItem() 
      item["title"] = title 
      print("Title is : %s" % title) 
      yield item 

然後,它應該運行?

回答

0

你可以通過覆蓋你的蜘蛛的init方法來做到這一點。

class MySpider(Spider): 
    name = "tutsplus" 
    allowed_domains = ["bbc.com"] 
    start_urls = ["http://www.bbc.com/"] 
    arg1 = None 
    arg2 = None 

    def __init__(self, arg1, arg2, *args, **kwargs): 
     self.arg1 = arg1 
     self.arg2 = arg2 
     super(MySpider, self).__init__(*args, **kwargs) 

    def parse(self, response): 
     links = response.xpath('//a/@href').extract() 
     # We stored already crawled links in this list 
     crawledLinks = [] 

     for link in links: 
      # If it is a proper link and is not checked yet, yield it to the Spider 
      # if linkPattern.match(link) and not link in crawledLinks: 
      if not link in crawledLinks: 
       link = "http://www.bbc.com" + link 
       crawledLinks.append(link) 
       yield Request(link, self.parse) 

     titles = response.xpath('//a[contains(@class, "media__link")]/text()').extract() 
     for title in titles: 
      item = TutsplusItem() 
      item["title"] = title 
      print("Title is : %s" % title) 
      yield item 

然後運行你的蜘蛛像

scrapy crawl tutsplus -a arg1=arg1 -a arg2=arg2 
+0

感謝您的答覆,但如果我要像 「scrapy runspider Crawler.py ARG1 ARG2」 運行,例如使用「getopt.getopt(參數是什麼,選項,[long_options])「 – Luckylukee

+0

您將不得不深入探究runpider命令實現的類實現。 –