2015-02-24 121 views
4

我可以運行在Python腳本從維基以下幾招爬行:傳遞參數給一個Python腳本內scrapy蜘蛛

from twisted.internet import reactor 
from scrapy.crawler import Crawler 
from scrapy import log, signals 
from testspiders.spiders.followall import FollowAllSpider 
from scrapy.utils.project import get_project_settings 

spider = FollowAllSpider(domain='scrapinghub.com') 
settings = get_project_settings() 
crawler = Crawler(settings) 
crawler.signals.connect(reactor.stop, signal=signals.spider_closed) 
crawler.configure() 
crawler.crawl(spider) 
crawler.start() 
log.start() 
reactor.run() 

正如你可以看到我可以只通過domainFollowAllSpider但我的問題是那我怎麼能通過start_urls(實際上是一個date將被添加到一個固定的URL)到我的蜘蛛類使用上面的代碼?

這是我的蜘蛛類:

class MySpider(CrawlSpider): 
    name = 'tw' 
    def __init__(self,date): 
     y,m,d=date.split('-') #this is a test , it could split with regex! 
     try: 
      y,m,d=int(y),int(m),int(d) 

     except ValueError: 
      raise 'Enter a valid date' 

     self.allowed_domains = ['mydomin.com'] 
     self.start_urls = ['my_start_urls{}-{}-{}'.format(y,m,d)] 

    def parse(self, response): 
     questions = Selector(response).xpath('//div[@class="result-link"]/span/a/@href') 
     for question in questions: 
      item = PoptopItem() 
      item['url'] = question.extract() 
      yield item['url'] 

,這是我的腳本:

from pdfcreator import convertor 
from twisted.internet import reactor 
from scrapy.crawler import Crawler 
from scrapy import log, signals 
#from testspiders.spiders.followall import FollowAllSpider 
from scrapy.utils.project import get_project_settings 
from poptop.spiders.stackoverflow_spider import MySpider 
from poptop.items import PoptopItem 

settings = get_project_settings() 
crawler = Crawler(settings) 
crawler.signals.connect(reactor.stop, signal=signals.spider_closed) 
crawler.configure() 

date=raw_input('Enter the date with this format (d-m-Y) : ') 
print date 
spider=MySpider(date=date) 
crawler.crawl(spider) 
crawler.start() 
log.start() 
item=PoptopItem() 

for url in item['url']: 
    convertor(url) 

reactor.run() # the script will block here until the spider_closed signal was sent 

,如果我只是打印item我會得到以下錯誤:

2015-02-25 17:13:47+0330 [tw] ERROR: Spider must return Request, BaseItem or None, got 'unicode' in <GET test-link2015-1-17> 

項目:

import scrapy 


class PoptopItem(scrapy.Item): 
    titles= scrapy.Field() 
    content= scrapy.Field() 
    url=scrapy.Field() 

回答

7

您需要修改您的__init__()構造函數以接受date參數。另外,我會用datetime.strptime()解析日期字符串:

from datetime import datetime 

class MySpider(CrawlSpider): 
    name = 'tw' 
    allowed_domains = ['test.com'] 

    def __init__(self, *args, **kwargs): 
     super(MySpider, self).__init__(*args, **kwargs) 

     date = kwargs.get('date') 
     if not date: 
      raise ValueError('No date given') 

     dt = datetime.strptime(date, "%m-%d-%Y") 
     self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)] 

然後,你會實例蜘蛛是這樣的:

spider = MySpider(date='01-01-2015') 

或者,你甚至可以不惜一切解析日期,傳遞datetime實例擺在首位:

class MySpider(CrawlSpider): 
    name = 'tw' 
    allowed_domains = ['test.com'] 

    def __init__(self, *args, **kwargs): 
     super(MySpider, self).__init__(*args, **kwargs) 

     dt = kwargs.get('dt') 
     if not dt: 
      raise ValueError('No date given') 

     self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)] 

spider = MySpider(dt=datetime(year=2014, month=01, day=01)) 

而且,僅供參考,請參閱this answer爲有關如何從腳本運行Scrapy一個詳細的例子。

+0

非常感謝您的解釋!正如我所說的,時間解析器是一個測試!也感謝鏈接建議,現在你可以看到我的'parse'函數產生'url'我怎麼能得到它? (在爬行後) – Kasramvd 2015-02-25 12:53:01

+0

我使用的項目,但它引發'KeyError'似乎它不運行爬行! '在URL ['url']中的URL:' – Kasramvd 2015-02-25 13:01:14

+0

@KasraAD我認爲你只需要'yield item'而不是'yield item ['url']''。讓我知道它是否有幫助。 – alecxe 2015-02-25 13:15:28