刪除NLTK中的停用詞

我想讀取一個文本文件（foo1.txt），刪除所有nltk定義的停用詞並在另一個文件（foo2.txt）中寫入。代碼如下：需要進口：從nltk.corpus進口停用詞刪除NLTK中的停用詞

def stop_words_removal(): 
    with open("foo1.txt") as f: 
      reading_file_line = f.readlines() #entire content, return list 
      #print reading_file_line #list 
      reading_file_info = [item.rstrip('\n') for item in reading_file_line] 
      #print reading_file_info #List and strip \n 
      #print ' '.join(reading_file_info) 
      '''-----------------------------------------''' 
      #Filtering & converting to lower letter 
      for i in reading_file_info: 
       words_filtered = [e.lower() for e in i.split() if len(e) >= 4]     
       print words_filtered 

      '''-----------------------------------------''' 
      '''removing the strop words from the file''' 
      word_list = words_filtered[:] 
      #print word_list 
      for word in words_filtered: 
         if word in nltk.corpus.stopwords.words('english'): 
          print word 
          print word_list.remove(word) 

      '''-----------------------------------------''' 
      '''write the output in a file''' 
      z = ' '.join(words_filtered) 
      out_file = open("foo2.txt", "w") 
      out_file.write(z) 
      out_file.close()

的問題是代碼「從文件中刪除滑索的話」的第二部分不起作用。任何建議將不勝感激。謝謝。

Example Input File: 
'I a Love this car there', 'positive', 
'This a view is amazing there', 'positive', 
'He is my best friend there', 'negative' 

Example Output: 
['love', "car',", "'positive',"] 
['view', "amazing',", "'positive',"] 
['best', "friend',", "'negative'"]

我想，因爲它在這個link建議，但他們沒有工作

來源

2013-05-17 J4cK

你確定這是你想要的輸出嗎？你需要標點符號嗎？ – elyase

@elyase感謝您的回覆。其實我不需要方括號，但我需要明確分隔每一行。您發佈的以下代碼僅適用於文件的最後一行。我想刪除文本每一行中的停用詞。 – J4cK

好的我編輯了我的答案 – elyase

這是我會做什麼，你的函數中：

with open('input.txt','r') as inFile, open('output.txt','w') as outFile: 
    for line in inFile: 
     print(''.join([word for word in line.lower().translate(None, string.punctuation).split() 
       if len(word) >=4 and word not in stopwords.words('english')]), file=outFile)

不要忘了補充：

from __future__ import print_function

如果你在Python 2.x.

來源

2013-05-17 16:25:52 elyase

刪除NLTK中的停用詞

回答

相關問題