編碼特殊字符

我有一段代碼，在Python3效果很好：編碼特殊字符

def encode_test(filepath, char_to_int): 
    with open(filepath, "r", encoding= "latin-1") as f: 
     dat = [line.rstrip() for line in f] 
     string_to_int = [[char_to_int[char] if char != 'ó' else char_to_int['ò'] for char in line] for line in dat]

然而，當我嘗試這樣做在Python2.7，我第一次得到了錯誤

SyntaxError: Non-ASCII character '\xc3' in file languageIdentification.py on line 30, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

然後我意識到我可能需要在代碼頂部添加#coding = utf-8。但是，這樣做後，我遇到了另一個錯誤：

UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal 
string_to_int = [[char_to_int[char] if char != 'ó' else char_to_int['ò'] for char in line] for line in dat] 
Traceback (most recent call last): 
File "languageIdentification.py", line 190, in <module> 
test_string = encode_test(sys.argv[3], char_to_int) 
File "languageIdentification.py", line 32, in encode_test 
string_to_int = [[char_to_int[char] if char != 'ó' else 
char_to_int['ò'] for char in line] for line in dat] 
KeyError: u'\xf3'

所以有人可以告訴我，我能做些什麼來解決Python2.7中的這個問題？

謝謝！

來源

2017-10-20 Parker

Python 3'str'對象實際上是等價於Python 2'unicode'對象，Python 2'str'對象等同於Python 3'bytes'。只需將* everything *轉換爲源代碼中的unicode對象並使用它們即可。 –

@ juanpa.arrivillaga其實我無法對源文件進行更改。無論如何，我可以直接在該計劃中進行操作嗎？ – Parker

什麼？你的意思是在你的文本文件中？你必須改變你的代碼，當'str'類型的性質發生根本性改變時，你不能指望能夠在Python 2中重新使用python 3代碼 –

的問題是，你試圖比較的unicode字符串和字節串：

char != 'ó'

凡char是Unicode和'ó'是一個字節串（或只是STR）。

當Python 2具有這樣的比較面，它試圖轉換（或解碼）：

byte-string -> unicode

轉換設置有默認編碼是ASCII在Python 2.
由於字節值'ó'高於127，則會導致錯誤（UnicodeWarning）。

順便說一句，對於字面上的字節值是在ASCII範圍內，比較將成功。
例子：

print u'ó' == 'ó' # UnicodeWarning: ... 
print u'z' == 'z' # True

所以，在比較你需要你的字節字符串轉換爲手動UNICODE。
例如，你可以做到這一點與內置unicode()功能：

u = unicode('ó', 'utf-8') # note, that you can specify encoding

，或只與'u' -literal：

u = u'ó'

但要注意：使用該選項的皈依將通過實施您在源文件頂部指定的編碼。
因此，您的實際源編碼和頂部聲明的編碼應該匹配。

正如我從SyntaxError看到的消息：在您的消息來源'ó'開始'\xc3' -byte。
因此它應該是「\xc3\xb3'這是UTF-8：

print '\xc3\xb3'.decode('utf-8') # ó

所以，# coding: utf-8 + char != u'ó'應該解決您的問題。

UPD。

當我從UnicodeWarning消息看 - 有第二個麻煩：KeyError

在聲明中會出現此錯誤：

char_to_int[char]

因爲u'\xf3'（實際上是u'ó'）不一個有效的密鑰。

此unicode來自解碼您的文件（與latin-1）。
我想，你的代碼char_to_int中根本沒有unicode密鑰。

所以，儘量用編碼這種一鍵返回到它的字節值：

char_to_int[char.encode('latin-1')]

總結，儘量提供代碼的最後一個字符串更改爲：

string_to_int = [[char_to_int[char.encode('latin-1')] if char != u'ó' else char_to_int['ò'] for char in line] for line in dat]

來源

2017-10-20 04:43:20 MaximTitarenko

謝謝。這工作 – Parker

如果您想將字符轉換爲其整數值，您可以使用ord函數，它也適用於Unicode。

line = u’some Unicode line with ò and ó’ 
string_to_int = [ord(char) if char!=u‘ó’ else ord(u’ò’) for char in line]

來源

2017-10-20 06:34:25

編碼特殊字符

回答

相關問題