2017-10-11 60 views
2

gensim.corpora.Dictionary是否保存了術語頻率?gensim.corpora.Dictionary是否有保存頻率的頻率?

gensim.corpora.Dictionary,它可能得到的話文檔頻率(即怎麼一個特定的詞出現在許多文件):

from nltk.corpus import brown 
from gensim.corpora import Dictionary 

documents = brown.sents() 
brown_dict = Dictionary(documents) 

# The 100th word in the dictionary: 'these' 
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents') 

[出]:

The word "these" appears in 1213 documents 

而且有filter_n_most_frequent(remove_n)函數可以刪除第n個最常用的標記:

filter_n_most_frequent(remove_n) 過濾掉出現在文檔中的'remove_n'最常見的標記。

修剪後,縮小詞ID中的空白。

注意:由於間隙縮小,在調用此函數之前和之後,同一個單詞可能會有不同的單詞ID!

filter_n_most_frequent函數是根據文檔頻率還是詞頻刪除第n個最頻繁的函數?

如果是後者,是否有某種方法可以訪問gensim.corpora.Dictionary對象中單詞的詞頻?

回答

2

不,gensim.corpora.Dictionary不保存術語頻率。你可以see the source code here。類只存儲以下成員變量:

self.token2id = {} # token -> tokenId 
    self.id2token = {} # reverse mapping for token2id; only formed on request, to save memory 
    self.dfs = {} # document frequencies: tokenId -> in how many documents this token appeared 

    self.num_docs = 0 # number of documents processed 
    self.num_pos = 0 # total number of corpus positions 
    self.num_nnz = 0 # total number of non-zeroes in the BOW matrix 

這意味着一切在類頻率定義爲文檔頻率,從未術語頻率,因爲後者從不全局存儲。這適用於filter_n_most_frequent(remove_n)以及其他所有方法。

0

你能做這樣的事嗎?

dictionary = corpora.Dictionary(documents) 
corpus = [dictionary.doc2bow(sent) for sent in documents] 
vocab = list(dictionary.values()) #list of terms in the dictionary 
vocab_tf = [dict(i) for i in corpus] 
vocab_tf = list(pd.DataFrame(vocab_tf).sum(axis=0)) #list of term frequencies