簡短版本:我有一個變量s = 'bär'
。我需要將s
轉換爲ASCII,以便s = 'b%C3%A4r'
。urllib.request中的Unicode字符串
龍版本:
我使用urllib.request.urlopen()
來讀取URL一個mp3語音文件。這工作得很好,除了我遇到問題,因爲URL通常包含Unicode字符。例如,德國的「Bär」。完整的網址是https://d7mj4aqfscim2.cloudfront.net/tts/de/token/bär
。事實上,將它作爲URL輸入到Chrome中,並將我導航到mp3文件時沒有任何問題。但是,將該相同的URL提供給urllib
會產生問題。
我確定這是一個Unicode的問題,因爲堆棧跟蹤寫着:
Traceback (most recent call last):
File "importer.py", line 145, in <module>
download_file(tuple[1], tuple[0], ".mp3")
File "importer.py", line 81, in download_file
with urllib.request.urlopen(url) as in_stream, open(to_fname+ext, 'wb') as out_file: #`with object as name:` safely __enter__() and __exit__() the runtime of object. `as` assigns `name` as referring to the object `object`.
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 465, in open
response = self._open(req, data)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 483, in _open
'_open', req)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 443, in _call_chain
result = func(*args)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1283, in https_open
context=self._context, check_hostname=self._check_hostname)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1240, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1083, in request
self._send_request(method, url, body, headers)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1118, in _send_request
self.putrequest(method, url, **skips)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 960, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 19: ordinal not in range(128)
...和較明顯UnicodeEncodeError
其他的,我可以看到它試圖encode()
爲ASCII。
有趣的是,當我從Chrome複製URL(而不是簡單地將它輸入Python解釋器)時,它將bär
翻譯爲b%C3%A4r
。當我把這個提供給urllib.request.urlopen()
時,它處理得很好,因爲所有這些字符都是ASCII。所以我的目標是在我的程序中進行這種轉換。我試圖讓我的原始字符串等同於unicode,但其所有變體中的unicodedata.normalize()
都不起作用;此外,我不確定如何將Unicode存儲爲ASCII,因爲Python 3將所有字符串存儲爲Unicode,因此不會嘗試轉換文本。
是否有簡單之間的串聯字符串和使用'urljoin()'有區別嗎?另外,這種類型的Unicode有沒有名稱?鑑於我從'normalize()'得到的Unicode完全不同,我想知道如何在討論時給它們命名。 –
對於你的情況,它不是嚴格要求使用'urljoin'。但考慮一下:'urllib.parse.urljoin('http://example.com/a/b/c','/ x/y/z')' – falsetru
這不是一個unicode。我聽說它被稱爲[百分比編碼](https://en.wikipedia.org/wiki/Percent-encoding)。 – falsetru