哪種編碼用於在Python中閱讀意大利文字？

我正在使用Visual Studio的Python工具並閱讀一些用意大利文寫成的文件。試過iso-8859-1，iso-8859-2，utf-8，utf-8-sig。 Notepad ++將文件打開爲不含BOM的UTF-8。哪種編碼用於在Python中閱讀意大利文字？

content = fp.read() 
words = content.decode("utf-8-sig").lower().split() 
for w in words: 
    p='' 
    cur.execute('SELECT word FROM multiwordnet.italian_lemma l, multiwordnet.italian_synset s where l.id = s.id and l.lemma="%s"' % w)

導致崩潰的字符串是C'è。（入門讀作"c\'\xe3\xa8"）

使用chardet的不利於

Traceback (most recent call last): 
File "C:\Users\Tathagata\Documents\Visual Studio 2012\Projects\PythonApplicati 
on4\PythonApplication4\PythonApplication4.py", line 344, in <module> 
createSynsetDict() 
File "C:\Users\Tathagata\Documents\Visual Studio 2012\Projects\PythonApplicati 
on4\PythonApplication4\PythonApplication4.py", line 294, in createSynsetDict 
cur.execute('SELECT word FROM multiwordnet.italian_lemma l, multiwordnet.it 
alian_synset s where l.id = s.id and l.lemma="%s"' % w) 
File "C:\Python27\lib\site-packages\pymysql\cursors.py", line 117, in execute 
self.errorhandler(self, exc, value) 
File "C:\Python27\lib\site-packages\pymysql\connections.py", line 187, in defa 
ulterrorhandler 
raise Error(errorclass, errorvalue) 
Error: (<type 'exceptions.UnicodeEncodeError'>, UnicodeEncodeError('ascii', u's\ 
x00\x00\x00\x03SELECT word FROM multiwordnet.italian_lemma l, multiwordnet.ital 
ian_synset s where l.id = s.id and l.lemma="c\'\xe3\xa8"', 116, 118, 'ordinal no 
t in range(128)'))

來源

2013-04-04 Tathagata

[如何停止的痛苦？（HTTP：// nedbatchelder .com/text/unipain.html） – 2013-04-04 23:53:06

您正在使用哪種DB-API綁定？（即，哪個數據庫驅動程序？） – 2013-04-04 23:56:40

...實際上，更重要的是，「paramstyle」全局值的值是多少你的數據庫庫的模塊？（如果你不知道，只需標識模塊，我們可以查看它） – 2013-04-04 23:59:02

假定該綁定變量的數據庫的風格format ...

content = fp.read() 
words = content.decode("utf-8-sig").lower().split() 
for w in words: 
    p='' 
    cur.execute('SELECT word FROM ' + 
       'multiwordnet.italian_lemma l, ' + 
       'multiwordnet.italian_synset s ' + 
       'where l.id = s.id and l.lemma=%s', w)

請注意，我們不使用%運算符在SQL字符串和被傳入的變量之間，並且我們不在%s周圍放置內部引號;相反，%s是一個佔位符，用於標識SQL在何處應該被替換，並且我們將該值作爲單獨參數傳遞給該佔位符。遵循這種做法不僅可以防止您需要處理編碼問題（如果您的參數作爲Python Unicode字符串傳遞，數據庫綁定負責從此處取回），還可以防止SQL injection安全漏洞。

Python的其他數據庫庫可能使用不同的佔位符樣式;請閱讀文檔或檢查模塊級別爲您的常數。（爲qmark預留位置應該是?;對於numeric它應該是冒號前綴號碼（:1第一個參數，:2第二等）

來源

2013-04-05 00:02:09

非常感謝回覆。我使用的是PyMySQL [https://github.com/petehunt/PyMySQL/] whic h有'paramstyle = format'這就是爲什麼代碼保持工作，直到它到達任何有趣的字符的單詞。如果我按照你的建議使用'？'，即使對於'％s'可讀的單詞，它也會拋出一個KeyError [http://pastebin.com/V7T6xbkY]。 – Tathagata 2013-04-05 02:35:56

@Thahagata是的 - 使用'format'，你應該使用'％s'而不是'？'，但仍然使用逗號而不是'％'運算符。我正在更新答案。 – 2013-04-05 13:02:30

@Thahagata ...順便說一下，今後請避免pastebin.com鏈接;對於不使用Adblock的人來說，它充滿了華麗的動畫廣告。對pastebin： – 2013-04-05 13:06:35

哪種編碼用於在Python中閱讀意大利文字？

回答

相關問題