絕對與scrapy的相對路徑

我正在嘗試抓取一個論壇，最終在帖子中發佈鏈接的帖子。現在我只是試圖抓取帖子的用戶名。但我認爲，這些網址不是靜態的。絕對與scrapy的相對路徑

spider.py 

from scrapy.spiders import CrawlSpider 
from scrapy.selector import Selector 
from scrapy.item import Item, Field 


class TextPostItem(Item): 
    title = Field() 
    url = Field() 
    submitted = Field() 


class RedditCrawler(CrawlSpider): 
    name = 'post-spider' 
    allowed_domains = ['flashback.org'] 
    start_urls = ['https://www.flashback.org/t2637903'] 


    def parse(self, response): 
     s = Selector(response) 
     next_link = s.xpath('//a[@class="smallfont2"]//@href').extract()[0] 
     if len(next_link): 
      yield self.make_requests_from_url(next_link) 
     posts = Selector(response).xpath('//div[@id="posts"]/div[@class="alignc.p4.post"]') 
     for post in posts: 
      i = TextPostItem() 
      i['title'] = post.xpath('tbody/tr[1]/td/span/text()').extract() [0] 
      #i['url'] = post.xpath('div[2]/ul/li[1]/a/@href').extract()[0] 
      yield i

爲我提供了以下錯誤：

raise ValueError('Missing scheme in request url: %s' % self._url) 
ValueError: Missing scheme in request url: /t2637903p2

任何想法？

來源

2015-10-22 Jomasdf

你需要「加入」 response.url與你使用urljoin()提取相對URL：

from urlparse import urljoin 

urljoin(response.url, next_link)

另外請注意，沒有必要實例化一個對象Selector - 您可以使用response.xpath()的快捷方式直接輸入：

def parse(self, response): 
    next_link = response.xpath('//a[@class="smallfont2"]//@href').extract()[0] 
    # ...

來源

2015-10-23 00:31:32 alecxe

您好，非常感謝您的回答。之前使用過「urljoin」我見過類似的解決方案。但我不明白如何在我的代碼中使用它。那究竟在哪裏呢？ – Jomasdf

@Jomasdf好的，當你提出請求時使用它：'yield self.make_requests_from_url（urljoin（response.url，next_link））'。 – alecxe

啊，我明白了。非常感謝你！ – Jomasdf

絕對與scrapy的相對路徑

回答

相關問題