鏈接我已經創建了一個擴展CrawlSpider,隨後在http://scrapy.readthedocs.org/en/latest/topics/spiders.html不能遵循使用Scrapy
問題的建議蜘蛛是我需要解析兩個起始URL(這恰好吻合與主機名)以及它所擁有的一些鏈接。
所以我定義了一條規則:rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_items', follow=True)]
,但沒有任何反應。
然後我試着定義一組規則,如:rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_items', follow=True), Rule(SgmlLinkExtractor(allow=['/']), callback='parse_items', follow=True)]
。現在的問題是,蜘蛛解析一切。
我該如何告訴蜘蛛解析_start_url_以及它只包含一些鏈接?
更新:
我試圖重寫parse_start_url
方法,所以現在我能夠從一開始就獲得頁面的數據,但它仍然沒有遵循與Rule
定義的鏈接:
class ExampleSpider(CrawlSpider):
name = 'TechCrunchCrawler'
start_urls = ['http://techcrunch.com']
allowed_domains = ['techcrunch.com']
rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_links', follow=True)]
def parse_start_url(self, response):
print '++++++++++++++++++++++++parse start url++++++++++++++++++++++++'
return self.parse_links(response)
def parse_links(self, response):
print '++++++++++++++++++++++++parse link called++++++++++++++++++++++++'
articles = []
for i in HtmlXPathSelector(response).select('//h2[@class="headline"]/a'):
article = Article()
article['title'] = i.select('./@title').extract()
article['link'] = i.select('./@href').extract()
articles.append(article)
return articles
ü可以張貼一些的我們的代碼在這裏識別以及 – 2012-07-10 09:36:28