0
我正在使用停用詞和句子分詞器,但是當我打印過濾的句子時,會給出包括停用詞的結果。問題在於它不會忽略輸出中的停用詞。如何刪除句子標記器中的停用詞?句子分詞器中的停用詞
userinput1 = input ("Enter file name:")
myfile1 = open(userinput1).read()
stop_words = set(stopwords.words("english"))
word1 = nltk.sent_tokenize(myfile1)
filtration_sentence = []
for w in word1:
word = sent_tokenize(myfile1)
filtered_sentence = [w for w in word if not w in stop_words]
print(filtered_sentence)
userinput2 = input ("Enter file name:")
myfile2 = open(userinput2).read()
stop_words = set(stopwords.words("english"))
word2 = nltk.sent_tokenize(myfile2)
filtration_sentence = []
for w in word2:
word = sent_tokenize(myfile2)
filtered_sentence = [w for w in word if not w in stop_words]
print(filtered_sentence)
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
'''remove punctuation, lowercase, stem'''
def normalize(text):
return stem_tokens(nltk.sent_tokenize(text.lower().translate(remove_punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
def cosine_sim(myfile1, myfile2):
tfidf = vectorizer.fit_transform([myfile1, myfile2])
return ((tfidf * tfidf.T).A)[0,1]
print(cosine_sim(myfile1,myfile2))
如何使用string.punctuation? @titipata – Muhammad
'import string'和'string.punctuation',然後你可以做'stopwords_en.union(string.punctuation)'。 – titipata
好吧,我正在努力實現這一點。還有一個問題。我上面的代碼將給兩個文件之間的餘弦相似性,但我希望它會顯示兩個文件之間的相似性句子..我怎麼能打印它們?@titipata – Muhammad