刪除文件中的特定單詞

我想刪除文件中的停用單詞（它包含一個句子，一個選項卡，然後是一個英文單詞）。停用詞是在一個單獨的文件中，語言是波斯語。下面的代碼工作，但問題是，它會刪除一行中的停用詞，但不會刪除其他行中的同一停用詞。它發生幾乎每一個停止詞。我猜也許它可以用於正常化。所以我通過導入hazm模塊（hazm就像NLTK，波斯語）來標準化這兩個文件。但問題沒有改變。一些身體可以幫助嗎？刪除文件中的特定單詞

from hazm import* 
punctuation = '!"#$%&\'()*+,-./:;<=>[email protected][\\]^_`{|}~،؟«؛' 

file1 = "stopwords.txt" 
file2 = "test/پر.txt" 


witoutStops = [] 
corpuslines = [] 

def RemStopWords (file1, file2): 
    with open(file1, encoding = "utf-8") as stopfile: 
     normalizer = Normalizer() 
     stopwords = stopfile.read() 
     stopwords = normalizer.normalize(stopwords) 
     with open(file2, encoding = "utf-8") as trainfile: 
      with open ("y.txt", "w", encoding = "utf-8") as newfile: 
       for line in trainfile: 
        tmp = line.strip().split("\t") 
        tmp[0] = normalizer.normalize(tmp[0]) 
        corpuslines.append(tmp) 
        for row in corpuslines: 
         line = "" 
         tokens = row[0].split() 
         for token in tokens: 
          if token not in stopwords: 
           line += token + " " 
        line = line.strip() + "\n" 
        for i in punctuation: # deletes punctuations 
         if i in line: 
          line = line.replace(i, "") 
        newfile.write(line) 
        witoutStops.append (line)

停止詞的文件： https://www.dropbox.com/s/irjkjmwkzwnnpnk/stopwords.txt?dl=0

文件： https://www.dropbox.com/s/p4m8san3xhr0pdj/%D9%BE%D8%B1.txt?dl=0

來源

2016-12-29 sara

[刪除使用正則表達式的停用詞]的可能重複（http://stackoverflow.com/questions/41417528/delete-stop-words-using-regular-expression） –

我發現這個問題。這是因爲在某些文字中，標點符號附在單詞上，代碼將其視爲單詞的一部分，而不是標點符號。如果首先刪除標點符號，則通過將屬於該部分的代碼的三行移動到行「tmp [0] = normalizer.normalize（tmp [0]）」中，然後刪除停用詞，所有停止單詞將被省略。

來源

2017-01-01 20:18:16 sara

刪除文件中的特定單詞

回答

相關問題