如何計算僅在文件的一部分中的唯一字的數量

我在波斯文件中有一個文件（一個波斯語句子，一個「選項卡」，然後是一個波斯文單詞，再次是一個「選項卡」，然後是一個英文單詞）。我必須計算波斯語句子中唯一字的數量，而不是標籤後的波斯語和英語單詞。這裏的代碼：如何計算僅在文件的一部分中的唯一字的數量

from hazm import* 

file = "F.txt" 
def WordsProbs (file): 
    words = set() 
    with open (file, encoding = "utf-8") as f1: 
     normalizer = Normalizer() 
     for line in f1: 
      tmp = line.strip().split("\t") 
      words.update(set(normalizer.normalize(tmp[0].split()))) 
    print(len(words), "unique words") 
    print (words)

要訪問只是句子，我必須分割每一行「\ t」。爲了訪問句子的每個單詞，我必須分割tmp [0]。問題是，當我運行代碼時發生下面的錯誤。這是因爲tmp [0]之後的分割。但是如果我在tmp [0]之後省略這個分割，它只會計算字母而不是唯一的詞。我該如何解決它？（有沒有另外一種方法來編寫這段代碼來計算獨特的單詞？）。

錯誤：回溯（最近通話最後一個）：文件「C：\用戶\ yasini \桌面\ 16.py」，第15行，在 WordsProbs（文件）文件「C：\用戶\文件「C：\ Python34 \ lib \ site-packages \ hazm \ Normalizer.py「，第46行，標準化 text = self.character_refinement（text）文件」C：\ Python34 \ lib \ site-packages \ hazm \ Normalizer.py「，line 65，in character_refinement text = text.translate（self.translations） AttributeError：'list'對象沒有屬性'translate'

示例文件： https://www.dropbox.com/s/r88hglemg7aot0w/F.txt?dl=0

來源

2016-10-29 Vahideh

我自己找到了。

from hazm import* 

file = "F.txt" 
def WordsProbs (file): 
    words = [] 
    mergelist = [] 
    with open (file, encoding = "utf-8") as f1: 
     normalizer = Normalizer() 
     for line in f1: 
      line = normalizer.normalize(line) 
      tmp = line.strip().split("\t") 
      words = tmp[0].split() 
      #print(len(words), "unique words") 
      #print (words) 
      for i in words: 
       mergelist.append(i) 
       uniq = set(mergelist) 
       uniqueWords = len(uniq)

來源

2016-10-30 11:47:00 Vahideh

的問題是，需要hazm.Normalizer.normalize空格分隔字符串作爲參數不是列表。您可以在「使用情況」標題下看到here的示例。

取下參數的.split()您規範化功能，使

words.update(set(normalizer.normalize(tmp[0].split())))

成爲

words.update(set(normalizer.normalize(tmp[0])))

，你應該是好去。

來源

2016-10-29 14:21:03 bunji

我以前試過。如果我忽略它，它會將字母不是單詞。 – Vahideh

如何計算僅在文件的一部分中的唯一字的數量

回答

相關問題