2015-10-22 228 views
1

我正在嘗試抓取一個論壇,最終在帖子中發佈鏈接的帖子。現在我只是試圖抓取帖子的用戶名。但我認爲,這些網址不是靜態的。絕對與scrapy的相對路徑

spider.py 

from scrapy.spiders import CrawlSpider 
from scrapy.selector import Selector 
from scrapy.item import Item, Field 


class TextPostItem(Item): 
    title = Field() 
    url = Field() 
    submitted = Field() 


class RedditCrawler(CrawlSpider): 
    name = 'post-spider' 
    allowed_domains = ['flashback.org'] 
    start_urls = ['https://www.flashback.org/t2637903'] 


    def parse(self, response): 
     s = Selector(response) 
     next_link = s.xpath('//a[@class="smallfont2"]//@href').extract()[0] 
     if len(next_link): 
      yield self.make_requests_from_url(next_link) 
     posts = Selector(response).xpath('//div[@id="posts"]/div[@class="alignc.p4.post"]') 
     for post in posts: 
      i = TextPostItem() 
      i['title'] = post.xpath('tbody/tr[1]/td/span/text()').extract() [0] 
      #i['url'] = post.xpath('div[2]/ul/li[1]/a/@href').extract()[0] 
      yield i 

爲我提供了以下錯誤:

raise ValueError('Missing scheme in request url: %s' % self._url) 
ValueError: Missing scheme in request url: /t2637903p2 

任何想法?

回答

1

你需要「加入」 response.url與你使用urljoin()提取相對URL:

from urlparse import urljoin 

urljoin(response.url, next_link) 

另外請注意,沒有必要實例化一個對象Selector - 您可以使用response.xpath()的快捷方式直接輸入:

def parse(self, response): 
    next_link = response.xpath('//a[@class="smallfont2"]//@href').extract()[0] 
    # ... 
+0

您好,非常感謝您的回答。之前使用過「urljoin」我見過類似的解決方案。但我不明白如何在我的代碼中使用它。那究竟在哪裏呢? – Jomasdf

+0

@Jomasdf好的,當你提出請求時使用它:'yield self.make_requests_from_url(urljoin(response.url,next_link))'。 – alecxe

+0

啊,我明白了。非常感謝你! – Jomasdf