2016-02-22 98 views
8

我正在尋找sklearn中的一個模塊,它可以讓您導出word-word co-ocurrence矩陣。我可以得到文檔術語矩陣,但不知道如何去獲得共同出現的單詞矩陣。word-word co-occurrence matrix

+0

您可以添加一些數據並嘗試解決問題嗎? – Cleb

回答

1

您可以在CountVectorizerTfidfVectorizer

代碼示例使用ngram_range參數:

bigram_vectorizer = CountVectorizer(ngram_range=(2, 2)) # by saying 2,2 you are telling you only want pairs of 2 words 

如果你想明確地說要來算,使用vocabulary哪些詞的共現PARAM,即:vocabulary = {'awesome unicorns':0, 'batman forever':1}

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

不言自明,隨時可以使用具有預定義的單詞同時出現的代碼。在這種情況下,我們只跟蹤的awesome unicornsbatman forever共同出現:

from sklearn.feature_extraction.text import CountVectorizer 
import numpy as np 
samples = ['awesome unicorns are awesome','batman forever and ever','I love batman forever'] 
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), vocabulary = {'awesome unicorns':0, 'batman forever':1}) 
co_occurrences = bigram_vectorizer.fit_transform(samples) 
print 'Printing sparse matrix:', co_occurrences 
print 'Printing dense matrix (cols are vocabulary keys 0-> "awesome unicorns", 1-> "batman forever")', co_occurrences.todense() 
sum_occ = np.sum(co_occurrences.todense(),axis=0) 
print 'Sum of word-word occurrences:', sum_occ 
print 'Pretty printig of co_occurrences count:', zip(bigram_vectorizer.get_feature_names(),np.array(sum_occ)[0].tolist()) 

最終輸出是('awesome unicorns', 1), ('batman forever', 2),這正好符合我們samples提供的數據。

11

這是我在scikit-learn中使用CountVectorizer的示例解決方案。參考這個post,你可以簡單地使用矩陣乘法來獲得單詞共現矩陣。

from sklearn.feature_extraction.text import CountVectorizer 
docs = ['this this this book', 
     'this cat good', 
     'cat good shit'] 
count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model 
X = count_model.fit_transform(docs) 
Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format 
Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0 
print(Xc.todense()) # print out matrix in dense format 

您也可以參考字的字典中count_model

count_model.vocabulary_ 

或者,如果你想通過角成分正常化(簡稱在以前的帖子接聽)。

import scipy.sparse as sp 
Xc = (X.T * X) 
g = sp.diags(1./Xc.diagonal()) 
Xc_norm = g * XC# normalized co-occurence matrix