我是scrapy的新手,我試圖用一個簡單的蜘蛛(建立在另一個蜘蛛的基礎上)構建一個網站:http://scraping.pro/web-scraping-python-scrapy-blog-series/)。無法抓取簡單的scrapy spider頁面
爲什麼我的蜘蛛爬行0頁(沒有錯誤):
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from items import NewsItem
class TutsPlus(CrawlSpider):
name = "tutsplus"
allowed_domains = ["net.tutsplus.com"]
start_urls = [
"http://code.tutsplus.com/posts?page="
]
rules = [Rule(LinkExtractor(allow=['/posts?page=\d+']), 'parse_story')]
def parse_story(self, response):
story = NewsItem()
story['url'] = response.url
story['title'] = response.xpath("//li[@class='posts__post']/a/text()").extract()
return story
非常類似蜘蛛運行良好:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from items import NewsItem
class BbcSpider(CrawlSpider):
name = "bbcnews"
allowed_domains = ["bbc.co.uk"]
start_urls = [
"http://www.bbc.co.uk/news/technology/",
]
rules = [Rule(LinkExtractor(allow=['/technology-\d+']), 'parse_story')]
def parse_story(self, response):
story = NewsItem()
story['url'] = response.url
story['headline'] = response.xpath("//title/text()").extract()
story['intro'] = response.css('story-body__introduction::text').extract()
return story
我想你的'allowed_domains'不允許開始頁面 – furas
@furas,不,這不是。我將allowed_domains更改爲:allowed_domains = [「code.tutsplus.com」],仍爲0頁。 – Macro