2016-09-30 92 views
1

我有這個類:Scrapy - 編碼問題 - 刮出報價

class PitchforkTracks(scrapy.Spider): 
    name = "pitchfork_tracks" 
    allowed_domains = ["pitchfork.com"] 
    start_urls = [ 
        "http://pitchfork.com/reviews/best/tracks/?page=1", 
        "http://pitchfork.com/reviews/best/tracks/?page=2", 
        "http://pitchfork.com/reviews/best/tracks/?page=3", 
        "http://pitchfork.com/reviews/best/tracks/?page=4", 
        "http://pitchfork.com/reviews/best/tracks/?page=5", 
    ] 
    def parse(self, response): 

     for sel in response.xpath('//div[@class="track-details"]/div[@class="row"]'): 
      item = PitchforkItem() 
      item['artist'] = sel.xpath('.//li/text()').extract_first() 
      item['track'] = sel.xpath('.//h2[@class="title"]/text()').extract_first() 
      yield item 

刮這個項目:

<h2 class="title" data-reactid="...>「Colours」</h2> 

的結果,但是,打印這樣的:

{'artist': u'The Avalanches', 'track': u'\u201cColours\u201d'} 

在哪裏以及如何去掉quotes,即\u201c\u201d

+0

你試過http://stackoverflow.com/questions/15321138/removing-unicode-u2026-like-characters-in-a-string-in-python2- 7? – Ben

+0

@Ben如果我寫道:'item ['track'] = item ['track']。decode('unicode_escape')。encode('ascii','ignore')'我得到這個回溯:'UnicodeEncodeError:'ascii '編解碼器不能編碼字符u'\ u201c'在位置0:序號不在範圍(128)'中 –

回答

1

裏面parse(self, response),添加:

item['track'] = sel.xpath('.//h2[@class="title"]/text()').extract_first().strip(u'\u201c\u201d')