獲取gensim和gensim.models.Word2Vec
模型中使用similar_by_word方法。
similar_by_word
需要3個參數,
- 輸入字
- N - 對於前n個類似的單詞(可選,默認值= 10)
- restrict_vocab(可選,默認=無)
示例
import gensim, nltk
class FileToSent(object):
"""A class to load a text file efficiently """
def __init__(self, filename):
self.filename = filename
# To remove stop words (optional)
self.stop = set(nltk.corpus.stopwords.words('english'))
def __iter__(self):
for line in open(self.filename, 'r'):
ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
yield ll
然後根據您輸入的句子(sentence_file.txt),
sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, min_count=2, hs=1)
print model.similar_by_word('hack', 2) # Get two most similar words to 'hack'
# [(u'debug', 0.967338502407074), (u'patch', 0.952264130115509)] (Output specific to my dataset)