如何在nltk列表中添加更多停用詞？

我有以下代碼。我必須在nltk stopword列表中添加更多的單詞。在我運行thsi之後，它不會添加列表中的單詞如何在nltk列表中添加更多停用詞？

from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer 
import string 
stop = set(stopwords.words('english'))  
new_words = open("stopwords_en.txt", "r") 
new_stopwords = stop.union(new_word) 
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer() 
def clean(doc): 
    stop_free = " ".join([i for i in doc.lower().split() if i not in new_stopwords])  
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude) 
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split()) 
    return normalized 
doc_clean = [clean(doc).split() for doc in emails_body_text]

來源

2017-09-21 Vrushab Jain

請修正縮進代碼 - 它沒有意義的方式，你有它。 – alexis

'new_stopwords = stop.union（new_word）'一定要讀'new_stopwords = stop.union（new_words）'？此外，'new_words = open（「stopwords_en.txt」，「r」）'會返回一個文件對象，所以您將文件對象添加到停用詞列表中，而不是內容。你想像'new_words = open（「stopwords_en.txt」，「r」）。readlines（）'肯定嗎？ –

不要盲目地做事。閱讀新的停用詞列表，檢查它是否正確，然後然後將其添加到其他停用詞列表中。從@greg_data建議的代碼開始，但你需要去掉換行符，也許還有其他的東西 - 誰知道你的停用詞文件是什麼樣的？

這可能做到這一點，例如：

new_words = open("stopwords_en.txt", "r").read().split() 
new_stopwords = stop.union(new_words)

PS。不要繼續分裂並加入你的文檔;標記一次並使用令牌列表工作。

來源

2017-09-21 13:29:38 alexis

如何在nltk列表中添加更多停用詞？

回答

相關問題