2016-01-01 87 views
0

我是scrapy的新手,我試圖用一個簡單的蜘蛛(建立在另一個蜘蛛的基礎上)構建一個網站:http://scraping.pro/web-scraping-python-scrapy-blog-series/)。無法抓取簡單的scrapy spider頁面

爲什麼我的蜘蛛爬行0頁(沒有錯誤):

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors import LinkExtractor 
from items import NewsItem 

class TutsPlus(CrawlSpider): 
    name = "tutsplus" 
    allowed_domains = ["net.tutsplus.com"] 
    start_urls = [ 
    "http://code.tutsplus.com/posts?page=" 
    ] 

    rules = [Rule(LinkExtractor(allow=['/posts?page=\d+']), 'parse_story')] 

    def parse_story(self, response): 
     story = NewsItem() 
     story['url'] = response.url 
     story['title']  = response.xpath("//li[@class='posts__post']/a/text()").extract()   
     return story 

非常類似蜘蛛運行良好:

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors import LinkExtractor 
from items import NewsItem 

class BbcSpider(CrawlSpider): 
    name = "bbcnews" 
    allowed_domains = ["bbc.co.uk"] 
    start_urls = [ 
    "http://www.bbc.co.uk/news/technology/", 
    ] 

    rules = [Rule(LinkExtractor(allow=['/technology-\d+']), 'parse_story')] 

    def parse_story(self, response): 
     story = NewsItem() 
     story['url'] = response.url 
     story['headline'] = response.xpath("//title/text()").extract() 
     story['intro'] = response.css('story-body__introduction::text').extract() 
     return story 
+0

我想你的'allowed_domains'不允許開始頁面 – furas

+0

@furas,不,這不是。我將allowed_domains更改爲:allowed_domains = [「code.tutsplus.com」],仍爲0頁。 – Macro

回答

0

看起來你的正則表達式'/posts?page=\d+'是不是你真正想要的,因爲這匹配url:'/postspage=2''/postpage=2'

我想你想要的東西像'/posts\?page=\d+',它逃脫?

+0

它幾乎按預期工作(不幸「差不多」)。蜘蛛爬行只有4頁:http://code.tutsplus.com/posts?page=2,http://code.tutsplus.com/posts?page=3,http://code.tutsplus.com/posts? page = 465,http://code.tutsplus.com/posts?page=466。任何想法爲什麼只有這些網頁? – Macro

+0

,因爲這些是唯一可用的匹配那個正則表達式?檢查網站。 – eLRuLL

+0

所有可用從1到466(http://code.tutsplus.com/posts?page=1,http://code.tutsplus.com/posts?page=2等等) – Macro