2014-06-12 29 views
2

我使用從文檔集合中創建詞典。每個文檔都是一個令牌列表。這是我的代碼如何將標記添加到gensim詞典

def constructModel(self, docTokens): 
    """ Given document tokens, constructs the tf-idf and similarity models""" 

    #construct dictionary for the BOW (vector-space) model : Dictionary = a mapping between words and their integer ids = collection of (word_index,word_string) pairs 
    #print "dictionary" 
    self.dictionary = corpora.Dictionary(docTokens) 

    # prune dictionary: remove words that appear too infrequently or too frequently 
    print "dictionary size before filter_extremes:",self.dictionary#len(self.dictionary.values()) 
    #self.dictionary.filter_extremes(no_below=1, no_above=0.9, keep_n=100000) 
    #self.dictionary.compactify() 

    print "dictionary size after filter_extremes:",self.dictionary 

    #construct the corpus bow vectors; bow vector = collection of (word_id,word_frequency) pairs 
    corpus_bow = [self.dictionary.doc2bow(doc) for doc in docTokens] 


    #construct the tf-idf model 
    self.model = models.TfidfModel(corpus_bow,normalize=True) 
    corpus_tfidf = self.model[corpus_bow] # first transform each raw bow vector in the corpus to the tfidf model's vector space 
    self.similarityModel = similarities.MatrixSimilarity(corpus_tfidf) # construct the term-document index 

我的問題是如何將一個新的文檔(標記)添加到這個字典和更新。我搜索在gensim文件,但我沒有找到一個解決方案

回答

6

沒有爲如何做到這一點的gensim網頁here

做的是創建另一個字典的新文件,然後合併的方式文檔他們。

from gensim import corpora 

dict1 = corpora.Dictionary(firstDocs) 
dict2 = corpora.Dictionary(moreDocs) 
dict1.merge_with(dict2) 

根據文檔,這將映射「相同的令牌映射到相同的ID和新的令牌到新的ID」。

0

可以使用add_documents方法:運行上面的代碼

from gensim import corpora 
text = [["aaa", "aaa"]] 
dictionary = corpora.Dictionary(text) 
dictionary.add_documents([['bbb','bbb']]) 
print(dictionary) 

後,你會得到這樣的:

Dictionary(2 unique tokens: ['aaa', 'bbb']) 

閱讀document瞭解更多詳情。