2016-04-08 22 views
0

Python新手(ofcourse)在解碼URI編碼的字符串時遇到了一些問題。Python,UTF-8 URI解碼

我的代碼:

#/usr/bin/env python 
# -*- coding: utf-8 -*- 
# Encoding: UTF-8 
... 
import urllib 
... 
secondTag = urllib.unquote(secondTag).decode('utf8') 
... 

secondTag = "flashvars=%7B%22video%22%3A%7B%22videoReferences%22%3A%5B%7B%22url%22%3A%22http%3A%2F%2Fsvtplay6a-f.akamaihd.net%2Fz%2Fse%2Fopen%2F20160405%2F1114066-002A%2FPG-1114066-002A-AFFARENRAMEL-01_%2C988%2C240%2C348%2C456%2C636%2C1680%2C2796%2C.mp4.csmil%2Fmanifest.f4m%22%2C%22playerType%22%3A%22flash%22%7D%5D%2C%22subtitleReferences%22%3A%5B%7B%22url%22%3A%22http%3A%2F%2Fmedia.svt.se%2Fdownload%2Fmcc%2Ftest%2Fcore-prd%2FSUB-1114066-002A-AFFARENRAMEL%2FSUB-1114066-002A-AFFARENRAMEL.wsrt%22%7D%5D%2C%22position%22%3A0%7D%2C%22statistics%22%3A%7B%22client%22%3A%22nojs%22%2C%22mmsClientNr%22%3A%221001001%22%2C%22programId%22%3A%221114066-002A%22%2C%22statisticsUrl%22%3A%22%2F%2Fld.svt.se%2Fsvt%2Fsvt%2Fs%3Fnojs.Aff%C3%A4ren%20Ramel.Avsnitt%202%22%2C%22title%22%3A%22Avsnitt%202%22%2C%22folderStructure%22%3A%22Aff%C3%A4ren%20Ramel%22%7D%2C%22context%22%3A%7B%7D%7D" 

結果是:

File "/home/mythtv/bin/pyPirateDownloader/svtPlay.py", line 70, in checkSecondSvtPage 
secondTag = urllib.unquote(secondTag).decode('utf8') 
File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode 
return codecs.utf_8_decode(input, errors, True) 
UnicodeEncodeError: 'ascii' codec can't encode characters in position 509-510: ordinal not in range(128) 

然而,當我在Python控制檯運行相同的我得到預期的結果:

>>> import urllib 
>>> secondTag = "flashvars=%7B%22video%22%3A%7B%22videoReferences%22%3A%5B%7B%22url%22%3A%22http%3A%2F%2Fsvtplay6a-f.akamaihd.net%2Fz%2Fse%2Fopen%2F20160405%2F1114066-002A%2FPG-1114066-002A-AFFARENRAMEL-01_%2C988%2C240%2C348%2C456%2C636%2C1680%2C2796%2C.mp4.csmil%2Fmanifest.f4m%22%2C%22playerType%22%3A%22flash%22%7D%5D%2C%22subtitleReferences%22%3A%5B%7B%22url%22%3A%22http%3A%2F%2Fmedia.svt.se%2Fdownload%2Fmcc%2Ftest%2Fcore-prd%2FSUB-1114066-002A-AFFARENRAMEL%2FSUB-1114066-002A-AFFARENRAMEL.wsrt%22%7D%5D%2C%22position%22%3A0%7D%2C%22statistics%22%3A%7B%22client%22%3A%22nojs%22%2C%22mmsClientNr%22%3A%221001001%22%2C%22programId%22%3A%221114066-002A%22%2C%22statisticsUrl%22%3A%22%2F%2Fld.svt.se%2Fsvt%2Fsvt%2Fs%3Fnojs.Aff%C3%A4ren%20Ramel.Avsnitt%202%22%2C%22title%22%3A%22Avsnitt%202%22%2C%22folderStructure%22%3A%22Aff%C3%A4ren%20Ramel%22%7D%2C%22context%22%3A%7B%7D%7D" 
>>> secondTag = urllib.unquote(secondTag).decode('utf8') 
>>> print secondTag 
flashvars={"video":{"videoReferences":[{"url":"http://svtplay6a-f.akamaihd.net/z/se/open/20160405/1114066-002A/PG-1114066-002A-AFFARENRAMEL-01_,988,240,348,456,636,1680,2796,.mp4.csmil/manifest.f4m","playerType":"flash"}],"subtitleReferences":[{"url":"http://media.svt.se/download/mcc/test/core-prd/SUB-1114066-002A-AFFARENRAMEL/SUB-1114066-002A-AFFARENRAMEL.wsrt"}],"position":0},"statistics":{"client":"nojs","mmsClientNr":"1001001","programId":"1114066-002A","statisticsUrl":"//ld.svt.se/svt/svt/s?nojs.Affären Ramel.Avsnitt 2","title":"Avsnitt 2","folderStructure":"Affären Ramel"},"context":{}} 

Ofcourse這是一些編碼問題,我想這與'ä'字符有關,因爲當不存在瑞典字符時不會出現這個問題,但我只是不知道爲什麼以及如何修理它。

有人能夠解釋,也許有幫助嗎?

感謝 /喬恩

回答

2

請注意,這是一個Unicode編碼錯誤:這是不是沒有對其進行解碼,這是它未能對其進行編碼!

因爲Python 2自動在strunicode之間轉換,所以如果嘗試解碼unicode字符串,則可能會出現編碼錯誤。

這可能是正在發生的事情在這裏:我在文件中設定,secondTagunicode對象:然後urllib.unquote會返回一個unicode對象,所以當你試圖將其解碼它首先嚐試將其編碼爲str對象,它可以使用默認的ascii編碼對其進行解碼,該編碼失敗。

有沒有特別優雅的方式來處理這個問題。可能最優雅的方式是urllib.unquote(secondTag.encode('utf8')).decode('utf8')。如果你想處理已經是str的情況,你可以很容易地添加if isinstance(secondTag, unicode) else secondTag

+0

非常好。嘗試了你的第一個解決方案,並做到了這一點。 必須真正嘗試瞭解這種編碼的東西。 謝謝隊友! /jon – jonsag