如何在NLTK（python）中處理中文？

輸入文件有碼「UTF-8無BOM」，每行是這樣的：如何在NLTK（python）中處理中文？

(IP (NP (NP (NR 上海) (NR 浦東)) (NP (NN 開發) (NP (CC 與) (NP (NN 法制) (NN 建設))))) (VP (VV 同步)))

我想用

nltk.tree.Tree.fromstring

使用NLTK從這個字符串構建樹我的輸出形式「\ u4e0a \ u6d77」。

如何將輸出轉換爲utf8？

我不明白爲什麼a的輸出是utf8的格式？

# -*- coding: utf-8 -*- 
import nltk 
tparse = nltk.tree.Tree.fromstring 
import sys 
reload(sys) 
sys.setdefaultencoding('utf8') 
class cal_prob: 
    def __init__(self): 
     pass 
    def input_dataset(self, path="CTB-auto-pos/"): 
     trainfile = open(path+"train.txt", "r+") 
     datas = trainfile.read().split("\n") 
     for data in datas: 
      data = unicode(data) # change them to unicode 
      print data 
      tree = tparse(data) 
      print tree 
      print unicode(str(tree)).decode("utf8") 
      print unicode(str(tree)).encode("utf8") 
      break 
     # 
     a = u"(IP \n (NP (NP (NR \u4e0a\u6d77) (NR \u6d66\u4e1c)) (NP (NN \u5f00\u53d1) (NP (CC \u4e0e) (NP (NN \u6cd5\u5236) (NN \u5efa\u8bbe))))) (VP (VV \u540c\u6b65)))" 
     print a 
     print a.decode("utf8") 
     trainfile.close() 
a = cal_prob() 
a.input_dataset()

來源

2016-12-17 Swind D.C. Xu

您正在使用Python 2.切換到Python 3 ** **今天會照顧這個問題（或至少90％），以及您未來的50個問題。當現在的版本變得更好時，將Python 2對字符編碼所做的瘋狂事情進行整理是不值得的。 – alexis

和" '\u6d66' ".decode('unicode-escape')很有用。 –

下面是一個正確打開編碼文件的示例。不需要reload(sys)技巧（請參閱https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/）或其他編碼/解碼。

tree.pformat()顯示樹，你想：

import nltk 
import io 

with io.open('train.txt', encoding='utf8') as trainfile: 
    for line in trainfile: 
     print tree 
     print 
     print tree.pformat()

輸出：

(IP 
    (NP 
    (NP (NR \u4e0a\u6d77) (NR \u6d66\u4e1c)) 
    (NP (NN \u5f00\u53d1) (NP (CC \u4e0e) (NP (NN \u6cd5\u5236) (NN \u5efa\u8bbe))))) 
    (VP (VV \u540c\u6b65))) 

(IP 
    (NP 
    (NP (NR 上海) (NR 浦東)) 
    (NP (NN 開發) (NP (CC 與) (NP (NN 法制) (NN 建設))))) 
    (VP (VV 同步)))

來源

2016-12-17 19:34:29

如何在NLTK（python）中處理中文？

回答

相關問題