gensim.corpora.Dictionary是否有保存頻率的頻率？

gensim.corpora.Dictionary是否保存了術語頻率？gensim.corpora.Dictionary是否有保存頻率的頻率？

從gensim.corpora.Dictionary，它可能得到的話文檔頻率（即怎麼一個特定的詞出現在許多文件）：

from nltk.corpus import brown 
from gensim.corpora import Dictionary 

documents = brown.sents() 
brown_dict = Dictionary(documents) 

# The 100th word in the dictionary: 'these' 
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents')

[出]：

The word "these" appears in 1213 documents

而且有filter_n_most_frequent(remove_n)函數可以刪除第n個最常用的標記：

filter_n_most_frequent(remove_n) 過濾掉出現在文檔中的'remove_n'最常見的標記。

修剪後，縮小詞ID中的空白。

注意：由於間隙縮小，在調用此函數之前和之後，同一個單詞可能會有不同的單詞ID！

filter_n_most_frequent函數是根據文檔頻率還是詞頻刪除第n個最頻繁的函數？

如果是後者，是否有某種方法可以訪問gensim.corpora.Dictionary對象中單詞的詞頻？

來源

2017-10-11 alvas

不，gensim.corpora.Dictionary不保存術語頻率。你可以see the source code here。類只存儲以下成員變量：

self.token2id = {} # token -> tokenId 
    self.id2token = {} # reverse mapping for token2id; only formed on request, to save memory 
    self.dfs = {} # document frequencies: tokenId -> in how many documents this token appeared 

    self.num_docs = 0 # number of documents processed 
    self.num_pos = 0 # total number of corpus positions 
    self.num_nnz = 0 # total number of non-zeroes in the BOW matrix

這意味着一切在類頻率定義爲文檔頻率，從未術語頻率，因爲後者從不全局存儲。這適用於filter_n_most_frequent(remove_n)以及其他所有方法。

來源

2017-10-17 05:51:36 ubadub

你能做這樣的事嗎？

dictionary = corpora.Dictionary(documents) 
corpus = [dictionary.doc2bow(sent) for sent in documents] 
vocab = list(dictionary.values()) #list of terms in the dictionary 
vocab_tf = [dict(i) for i in corpus] 
vocab_tf = list(pd.DataFrame(vocab_tf).sum(axis=0)) #list of term frequencies

來源

2017-12-28 17:01:34

gensim.corpora.Dictionary是否有保存頻率的頻率？

回答

相關問題