在Python中使用Unicode爲波斯語

我正在寫一個腳本來從一個語料庫文件中讀取並查找後綴。由於在語料庫中有波斯語單詞，因此它是UTF-8編碼，但是當我使用波斯語後綴進行搜索時，我無法獲得結果，另一方面，英語結果很好。在Python中使用Unicode爲波斯語

from __future__ import unicode_literals 
import nltk 
import sys 


for line in open("corpus.txt"): 
for word in line.split(): 
    if word.endswith('ب'): 
     print (word)

來源

2015-05-07 adel rahimi

你的意思是什麼*我沒有結果*？ – Kasramvd

和你的Python版本是什麼？（似乎你在Python 3），但我需要確定！ – Kasramvd

我使用的是Python 3.4，實際上我沒有在shell中得到任何結果，就好像在語料庫中沒有任何詞語一樣，@Kasra –

在Python 3，你可以通過encoding=utf-8到open：

with open("corpus.txt", encoding="utf-8") as fp: 
    for line in fp: 
     for word in line.split(): 
      process(word)

在Python 2，你需要做這樣的事情：

import codecs 
with codecs.open("corpus.txt", encoding="utf-8") as fp: 
    for line in fp: 
     for word in line.split(): 
      process(word)

來源

2015-05-07 15:08:09

其實我使用Python 3.4，但它的工作表示感謝。 –

在Python中使用Unicode爲波斯語

回答

相關問題