2016-04-04 95 views
1

簡短版本:我有一個變量s = 'bär'。我需要將s轉換爲ASCII,以便s = 'b%C3%A4r'urllib.request中的Unicode字符串

龍版本:

我使用urllib.request.urlopen()來讀取URL一個mp3語音文件。這工作得很好,除了我遇到問題,因爲URL通常包含Unicode字符。例如,德國的「Bär」。完整的網址是https://d7mj4aqfscim2.cloudfront.net/tts/de/token/bär。事實上,將它作爲URL輸入到Chrome中,並將我導航到mp3文件時沒有任何問題。但是,將該相同的URL提供給urllib會產生問題。

我確定這是一個Unicode的問題,因爲堆棧跟蹤寫着:

Traceback (most recent call last): 
    File "importer.py", line 145, in <module> 
    download_file(tuple[1], tuple[0], ".mp3") 
    File "importer.py", line 81, in download_file 
    with urllib.request.urlopen(url) as in_stream, open(to_fname+ext, 'wb') as out_file: #`with object as name:` safely __enter__() and __exit__() the runtime of object. `as` assigns `name` as referring to the object `object`. 
    File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 162, in urlopen 
    return opener.open(url, data, timeout) 
    File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 465, in open 
    response = self._open(req, data) 
    File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 483, in _open 
    '_open', req) 
    File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 443, in _call_chain 
    result = func(*args) 
    File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1283, in https_open 
    context=self._context, check_hostname=self._check_hostname) 
    File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1240, in do_open 
    h.request(req.get_method(), req.selector, req.data, headers) 
    File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1083, in request 
    self._send_request(method, url, body, headers) 
    File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1118, in _send_request 
    self.putrequest(method, url, **skips) 
    File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 960, in putrequest 
    self._output(request.encode('ascii')) 
UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 19: ordinal not in range(128) 

...和較明顯UnicodeEncodeError其他的,我可以看到它試圖encode()爲ASCII。

有趣的是,當我從Chrome複製URL(而不是簡單地將它輸入Python解釋器)時,它將bär翻譯爲b%C3%A4r。當我把這個提供給urllib.request.urlopen()時,它處理得很好,因爲所有這些字符都是ASCII。所以我的目標是在我的程序中進行這種轉換。我試圖讓我的原始字符串等同於unicode,但其所有變體中的unicodedata.normalize()都不起作用;此外,我不確定如何將Unicode存儲爲ASCII,因爲Python 3將所有字符串存儲爲Unicode,因此不會嘗試轉換文本。

回答

1

使用urllib.parse.quote

>>> urllib.parse.quote('bär') 
'b%C3%A4r' 

>>> urllib.parse.urljoin('https://d7mj4aqfscim2.cloudfront.net/tts/de/token/', 
...      urllib.parse.quote('bär')) 
'https://d7mj4aqfscim2.cloudfront.net/tts/de/token/b%C3%A4r' 
+0

是否有簡單之間的串聯字符串和使用'urljoin()'有區別嗎?另外,這種類型的Unicode有沒有名稱?鑑於我從'normalize()'得到的Unicode完全不同,我想知道如何在討論時給它們命名。 –

+1

對於你的情況,它不是嚴格要求使用'urljoin'。但考慮一下:'urllib.parse.urljoin('http://example.com/a/b/c','/ x/y/z')' – falsetru

+1

這不是一個unicode。我聽說它被稱爲[百分比編碼](https://en.wikipedia.org/wiki/Percent-encoding)。 – falsetru