2011-04-27 112 views
9

我想從Python-1251(Cyrillic)到Python的文件內容轉換爲Unicode。我發現這個功能,但它不起作用。Python:如何將Windows 1251轉換爲Unicode?

#!/usr/bin/env python 

import os 
import sys 
import shutil 

def convert_to_utf8(filename): 
# gather the encodings you think that the file may be 
# encoded inside a tuple 
encodings = ('windows-1253', 'iso-8859-7', 'macgreek') 

# try to open the file and exit if some IOError occurs 
try: 
    f = open(filename, 'r').read() 
except Exception: 
    sys.exit(1) 

# now start iterating in our encodings tuple and try to 
# decode the file 
for enc in encodings: 
    try: 
     # try to decode the file with the first encoding 
     # from the tuple. 
     # if it succeeds then it will reach break, so we 
     # will be out of the loop (something we want on 
     # success). 
     # the data variable will hold our decoded text 
     data = f.decode(enc) 
     break 
    except Exception: 
     # if the first encoding fail, then with the continue 
     # keyword will start again with the second encoding 
     # from the tuple an so on.... until it succeeds. 
     # if for some reason it reaches the last encoding of 
     # our tuple without success, then exit the program. 
     if enc == encodings[-1]: 
      sys.exit(1) 
     continue 

# now get the absolute path of our filename and append .bak 
# to the end of it (for our backup file) 
fpath = os.path.abspath(filename) 
newfilename = fpath + '.bak' 
# and make our backup file with shutil 
shutil.copy(filename, newfilename) 

# and at last convert it to utf-8 
f = open(filename, 'w') 
try: 
    f.write(data.encode('utf-8')) 
except Exception, e: 
    print e 
finally: 
    f.close() 

我該怎麼做?

謝謝

+0

[Unicode]是什麼意思?(http://en.wikipedia.org/wiki/Unicode)? – Gumbo 2011-04-27 16:00:19

+0

@Gumbo,通過代碼判斷輸出是UTF-8。 – 2011-04-27 16:03:25

回答

10
import codecs 

f = codecs.open(filename, 'r', 'cp1251') 
u = f.read() # now the contents have been transformed to a Unicode string 
out = codecs.open(output, 'w', 'utf-8') 
out.write(u) # and now the contents have been output as UTF-8 

這是你打算做什麼?

+0

我覺得你非常接近!我設法從XML讀取數據,但是當我將它寫入文件時,我得到了奇怪的字符而不是西里爾字符。 – Alex 2011-04-27 16:26:59

+0

是的!我知道了!我正在使用cp1252。非常感謝你 – Alex 2011-04-27 16:30:52

+0

@Alex,很高興知道你的代碼正常工作。你可能想看看http://www.evanjones.ca/python-utf8.html,那裏有一些很好的提示。 – buruzaemon 2011-04-27 16:38:08

0

如果使用codecs模塊打開該文件,它會進行轉換,當你從文件中讀取到UNICODE爲您服務。例如: -

import codecs 
f = codecs.open('input.txt', encoding='cp1251') 
assert isinstance(f.read(), unicode) 

這纔有意義,如果你正在使用Python文件的數據的工作。如果您試圖在文件系統上將文件從一種編碼轉換爲另一種編碼(這是您發佈的腳本嘗試執行的操作),則必須指定實際編碼,因爲您無法在「統一」。

+0

我仍然得到一個錯誤UnicodeEncodeError:'charmap'編解碼器不能編碼字符位置: 字符映射到 Alex 2011-04-27 16:04:39

+0

什麼是您使用的實際代碼?什麼行觸發這個異常? – 2011-04-27 16:08:44

+0

我在f = open(filename,'r')中得到一個錯誤。read() – Alex 2011-04-27 16:11:04

0

這只是一個猜測,因爲您沒有指定「不工作」的含義。

如果文件正在正確生成但似乎包含垃圾字符,則可能是您正在查看的應用程序無法識別它包含UTF-8。您需要將BOM添加到文件的開頭 - 3個字節0xEF,0xBB,0xBF(未編碼)。

+0

我在f = open(filename,'r')中得到一個錯誤。read() – Alex 2011-04-27 16:10:56