如何將Unicode字符串作爲參數傳遞給urllib.urlencode（）

我正在使用Microsoft的免費翻譯服務將一些印地語字符翻譯爲英語。他們不提供一個Python API，但我借來代碼：tinyurl.com/dxh6thr如何將Unicode字符串作爲參數傳遞給urllib.urlencode（）

我想這裏的描述使用「檢測」的方法：tinyurl.com/bxkt3we

的'hindi.txt'文件保存在unicode字符集中。

>>> hindi_string = open('hindi.txt').read() 
>>> data = { 'text' : hindi_string } 
>>> token = msmt.get_access_token(MY_USERID, MY_TOKEN) 
>>> request = urllib2.Request('http://api.microsofttranslator.com/v2/Http.svc/Detect?'+urllib.urlencode(data)) 
>>> request.add_header('Authorization', 'Bearer '+token) 
>>> response = urllib2.urlopen(request) 
>>> print response.read() 
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">en</string> 
>>>

響應顯示翻譯器檢測到'en'，而不是'hi'（用於印地語）。當我檢查的編碼，它顯示爲「字符串」：

>>> type(hindi_string) 
<type 'str'>

供參考，在這裏是「hindi.txt」的內容：

हाय, कैसे आप आज कर रहे हैं। मैं अच्छी तरह से, आपको धन्यवाद कर रहा हूँ।

我不知道，如果使用string.encode或string.decode在這裏適用。如果是這樣，我需要對/從/進行編碼/解碼需要什麼？將一個Unicode字符串作爲urllib.urlencode參數傳遞的最佳方法是什麼？我如何確保實際的印地語字符作爲參數傳遞？

謝謝。

的建議，但我得到以下錯誤**附加信息**

我嘗試使用codecs.open（）：

>>> hindi_new = codecs.open('hindi.txt', encoding='utf-8').read() 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "C:\Python27\lib\codecs.py", line 671, in read 
    return self.reader.read(size) 
    File "C:\Python27\lib\codecs.py", line 477, in read 
    newchars, decodedbytes = self.decode(data, self.errors) 
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

這裏是再版（hindi_string）輸出：

>>> repr(hindi_string) 
"'\\xff\\xfe9\\t>\\t/\\t,\\x00 \\x00\\x15\\tH\\t8\\tG\\t \\x00\\x06\\t*\\t \\x00 
\\x06\\t\\x1c\\t \\x00\\x15\\t0\\t \\x000\\t9\\tG\\t \\x009\\tH\\t\\x02\\td\\t \ 
\x00.\\tH\\t\\x02\\t \\x00\\x05\\t'"

來源

2012-11-02 Logic Al

在其編碼你保存文件？您是否嘗試使用'codecs.open'而不是簡單的'open'來獲取正確編碼的文件內容？ – Bakuriu

您顯示'hindi_string'定義但不是'hindi'。請顯示'repr（印地語）'。 – eryksun

閱讀[絕對最小每個軟件開發人員絕對，積極必須知道Unicode和字符集（沒有藉口！）]（http://www.joelonsoftware.com/articles/Unicode.html）。 – katrielalex

你的文件是utf-16，所以你需要在發送前對內容進行解碼：

hindi_string = open('hindi.txt').read().decode('utf-16') 
data = { 'text' : hindi_string.encode('utf-8') } 
...

來源

2012-11-02 21:18:40 mata

非常感謝您的先生！這工作完美:) –

你可以嘗試使用codecs.open打開該文件，並將其與utf-8解碼：

import codecs 

with codecs.open('hindi.txt', encoding='utf-8') as f: 
    hindi_text = f.read()

來源

2012-11-02 20:42:01

如何將Unicode字符串作爲參數傳遞給urllib.urlencode（）

回答

相關問題