1
我有這個類:Scrapy - 編碼問題 - 刮出報價
class PitchforkTracks(scrapy.Spider):
name = "pitchfork_tracks"
allowed_domains = ["pitchfork.com"]
start_urls = [
"http://pitchfork.com/reviews/best/tracks/?page=1",
"http://pitchfork.com/reviews/best/tracks/?page=2",
"http://pitchfork.com/reviews/best/tracks/?page=3",
"http://pitchfork.com/reviews/best/tracks/?page=4",
"http://pitchfork.com/reviews/best/tracks/?page=5",
]
def parse(self, response):
for sel in response.xpath('//div[@class="track-details"]/div[@class="row"]'):
item = PitchforkItem()
item['artist'] = sel.xpath('.//li/text()').extract_first()
item['track'] = sel.xpath('.//h2[@class="title"]/text()').extract_first()
yield item
刮這個項目:
<h2 class="title" data-reactid="...>「Colours」</h2>
的結果,但是,打印這樣的:
{'artist': u'The Avalanches', 'track': u'\u201cColours\u201d'}
在哪裏以及如何去掉quotes
,即\u201c
和\u201d
?
你試過http://stackoverflow.com/questions/15321138/removing-unicode-u2026-like-characters-in-a-string-in-python2- 7? – Ben
@Ben如果我寫道:'item ['track'] = item ['track']。decode('unicode_escape')。encode('ascii','ignore')'我得到這個回溯:'UnicodeEncodeError:'ascii '編解碼器不能編碼字符u'\ u201c'在位置0:序號不在範圍(128)'中 –