2017-04-23 64 views
0

我正在使用停用詞和句子分詞器,但是當我打印過濾的句子時,會給出包括停用詞的結果。問題在於它不會忽略輸出中的停用詞。如何刪除句子標記器中的停用詞?句子分詞器中的停用詞

userinput1 = input ("Enter file name:") 
    myfile1 = open(userinput1).read() 
    stop_words = set(stopwords.words("english")) 
    word1 = nltk.sent_tokenize(myfile1) 
    filtration_sentence = [] 
    for w in word1: 
     word = sent_tokenize(myfile1) 
     filtered_sentence = [w for w in word if not w in stop_words] 
     print(filtered_sentence) 

    userinput2 = input ("Enter file name:") 
    myfile2 = open(userinput2).read() 
    stop_words = set(stopwords.words("english")) 
    word2 = nltk.sent_tokenize(myfile2) 
    filtration_sentence = [] 
    for w in word2: 
     word = sent_tokenize(myfile2) 
     filtered_sentence = [w for w in word if not w in stop_words] 
     print(filtered_sentence) 

    stemmer = nltk.stem.porter.PorterStemmer() 
    remove_punctuation_map = dict((ord(char), None) for char in string.punctuation) 

    def stem_tokens(tokens): 
     return [stemmer.stem(item) for item in tokens] 

    '''remove punctuation, lowercase, stem''' 
    def normalize(text): 
     return stem_tokens(nltk.sent_tokenize(text.lower().translate(remove_punctuation_map))) 
    vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english') 

    def cosine_sim(myfile1, myfile2): 
     tfidf = vectorizer.fit_transform([myfile1, myfile2]) 
     return ((tfidf * tfidf.T).A)[0,1] 
    print(cosine_sim(myfile1,myfile2)) 

回答

0

我覺得你不能直接從句子中刪除stopwords。您必須首先將每個單詞的句子拆分出來,或使用nltk.word_tokenize來拆分句子。對於每個單詞,您檢查它是否在停用詞列表中。這裏有一個例子:

import nltk 
from nltk.corpus import stopwords 
stopwords_en = set(stopwords.words('english')) 

sents = nltk.sent_tokenize("This is an example sentence. We will remove stop words from this") 

sents_rm_stopwords = [] 
for sent in sents: 
    sents_rm_stopwords.append(' '.join(w for w in nltk.word_tokenize(sent) if w.lower() not in stopwords_en)) 

輸出

['example sentence .', 'remove stop words'] 

,您還可以使用string.punctuation刪除標點。

import string 
stopwords_punctuation = stopwords_en.union(string.punctuation) # merge set together 
+0

如何使用string.punctuation? @titipata – Muhammad

+0

'import string'和'string.punctuation',然後你可以做'stopwords_en.union(string.punctuation)'。 – titipata

+0

好吧,我正在努力實現這一點。還有一個問題。我上面的代碼將給兩個文件之間的餘弦相似性,但我希望它會顯示兩個文件之間的相似性句子..我怎麼能打印它們?@titipata – Muhammad