4
我可以運行在Python腳本從維基以下幾招爬行:傳遞參數給一個Python腳本內scrapy蜘蛛
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
正如你可以看到我可以只通過domain
到FollowAllSpider
但我的問題是那我怎麼能通過start_urls
(實際上是一個date
將被添加到一個固定的URL)到我的蜘蛛類使用上面的代碼?
這是我的蜘蛛類:
class MySpider(CrawlSpider):
name = 'tw'
def __init__(self,date):
y,m,d=date.split('-') #this is a test , it could split with regex!
try:
y,m,d=int(y),int(m),int(d)
except ValueError:
raise 'Enter a valid date'
self.allowed_domains = ['mydomin.com']
self.start_urls = ['my_start_urls{}-{}-{}'.format(y,m,d)]
def parse(self, response):
questions = Selector(response).xpath('//div[@class="result-link"]/span/a/@href')
for question in questions:
item = PoptopItem()
item['url'] = question.extract()
yield item['url']
,這是我的腳本:
from pdfcreator import convertor
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
#from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
from poptop.spiders.stackoverflow_spider import MySpider
from poptop.items import PoptopItem
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
date=raw_input('Enter the date with this format (d-m-Y) : ')
print date
spider=MySpider(date=date)
crawler.crawl(spider)
crawler.start()
log.start()
item=PoptopItem()
for url in item['url']:
convertor(url)
reactor.run() # the script will block here until the spider_closed signal was sent
,如果我只是打印item
我會得到以下錯誤:
2015-02-25 17:13:47+0330 [tw] ERROR: Spider must return Request, BaseItem or None, got 'unicode' in <GET test-link2015-1-17>
項目:
import scrapy
class PoptopItem(scrapy.Item):
titles= scrapy.Field()
content= scrapy.Field()
url=scrapy.Field()
非常感謝您的解釋!正如我所說的,時間解析器是一個測試!也感謝鏈接建議,現在你可以看到我的'parse'函數產生'url'我怎麼能得到它? (在爬行後) – Kasramvd 2015-02-25 12:53:01
我使用的項目,但它引發'KeyError'似乎它不運行爬行! '在URL ['url']中的URL:' – Kasramvd 2015-02-25 13:01:14
@KasraAD我認爲你只需要'yield item'而不是'yield item ['url']''。讓我知道它是否有幫助。 – alecxe 2015-02-25 13:15:28