Python：如何將Windows 1251轉換爲Unicode？

我想從Python-1251（Cyrillic）到Python的文件內容轉換爲Unicode。我發現這個功能，但它不起作用。Python：如何將Windows 1251轉換爲Unicode？

#!/usr/bin/env python 

import os 
import sys 
import shutil 

def convert_to_utf8(filename): 
# gather the encodings you think that the file may be 
# encoded inside a tuple 
encodings = ('windows-1253', 'iso-8859-7', 'macgreek') 

# try to open the file and exit if some IOError occurs 
try: 
    f = open(filename, 'r').read() 
except Exception: 
    sys.exit(1) 

# now start iterating in our encodings tuple and try to 
# decode the file 
for enc in encodings: 
    try: 
     # try to decode the file with the first encoding 
     # from the tuple. 
     # if it succeeds then it will reach break, so we 
     # will be out of the loop (something we want on 
     # success). 
     # the data variable will hold our decoded text 
     data = f.decode(enc) 
     break 
    except Exception: 
     # if the first encoding fail, then with the continue 
     # keyword will start again with the second encoding 
     # from the tuple an so on.... until it succeeds. 
     # if for some reason it reaches the last encoding of 
     # our tuple without success, then exit the program. 
     if enc == encodings[-1]: 
      sys.exit(1) 
     continue 

# now get the absolute path of our filename and append .bak 
# to the end of it (for our backup file) 
fpath = os.path.abspath(filename) 
newfilename = fpath + '.bak' 
# and make our backup file with shutil 
shutil.copy(filename, newfilename) 

# and at last convert it to utf-8 
f = open(filename, 'w') 
try: 
    f.write(data.encode('utf-8')) 
except Exception, e: 
    print e 
finally: 
    f.close()

我該怎麼做？

謝謝

來源

2011-04-27 Alex

[Unicode]是什麼意思？（http://en.wikipedia.org/wiki/Unicode）？ – Gumbo 2011-04-27 16:00:19

@Gumbo，通過代碼判斷輸出是UTF-8。 – 2011-04-27 16:03:25

import codecs 

f = codecs.open(filename, 'r', 'cp1251') 
u = f.read() # now the contents have been transformed to a Unicode string 
out = codecs.open(output, 'w', 'utf-8') 
out.write(u) # and now the contents have been output as UTF-8

這是你打算做什麼？

來源

2011-04-27 16:15:26 buruzaemon

我覺得你非常接近！我設法從XML讀取數據，但是當我將它寫入文件時，我得到了奇怪的字符而不是西里爾字符。 – Alex 2011-04-27 16:26:59

是的！我知道了！我正在使用cp1252。非常感謝你 – Alex 2011-04-27 16:30:52

@Alex，很高興知道你的代碼正常工作。你可能想看看http://www.evanjones.ca/python-utf8.html，那裏有一些很好的提示。 – buruzaemon 2011-04-27 16:38:08

如果使用codecs模塊打開該文件，它會進行轉換，當你從文件中讀取到UNICODE爲您服務。例如： -

import codecs 
f = codecs.open('input.txt', encoding='cp1251') 
assert isinstance(f.read(), unicode)

這纔有意義，如果你正在使用Python文件的數據的工作。如果您試圖在文件系統上將文件從一種編碼轉換爲另一種編碼（這是您發佈的腳本嘗試執行的操作），則必須指定實際編碼，因爲您無法在「統一」。

來源

2011-04-27 16:02:53

我仍然得到一個錯誤UnicodeEncodeError：'charmap'編解碼器不能編碼字符位置：字符映射到 – Alex 2011-04-27 16:04:39

什麼是您使用的實際代碼？什麼行觸發這個異常？ – 2011-04-27 16:08:44

我在f = open（filename，'r'）中得到一個錯誤。read（） – Alex 2011-04-27 16:11:04

這只是一個猜測，因爲您沒有指定「不工作」的含義。

如果文件正在正確生成但似乎包含垃圾字符，則可能是您正在查看的應用程序無法識別它包含UTF-8。您需要將BOM添加到文件的開頭 - 3個字節0xEF,0xBB,0xBF（未編碼）。

來源

2011-04-27 16:07:57

我在f = open（filename，'r'）中得到一個錯誤。read（） – Alex 2011-04-27 16:10:56

Python：如何將Windows 1251轉換爲Unicode？

回答

相關問題