word-word co-occurrence matrix

我正在尋找sklearn中的一個模塊，它可以讓您導出word-word co-ocurrence矩陣。我可以得到文檔術語矩陣，但不知道如何去獲得共同出現的單詞矩陣。word-word co-occurrence matrix

2016-02-22 newdev14

您可以添加一些數據並嘗試解決問題嗎？ – Cleb

您可以在CountVectorizer或TfidfVectorizer

代碼示例使用ngram_range參數：

bigram_vectorizer = CountVectorizer(ngram_range=(2, 2)) # by saying 2,2 you are telling you only want pairs of 2 words

如果你想明確地說要來算，使用vocabulary哪些詞的共現PARAM，即：vocabulary = {'awesome unicorns':0, 'batman forever':1}

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

不言自明，隨時可以使用具有預定義的單詞同時出現的代碼。在這種情況下，我們只跟蹤的awesome unicorns和batman forever共同出現：

from sklearn.feature_extraction.text import CountVectorizer 
import numpy as np 
samples = ['awesome unicorns are awesome','batman forever and ever','I love batman forever'] 
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), vocabulary = {'awesome unicorns':0, 'batman forever':1}) 
co_occurrences = bigram_vectorizer.fit_transform(samples) 
print 'Printing sparse matrix:', co_occurrences 
print 'Printing dense matrix (cols are vocabulary keys 0-> "awesome unicorns", 1-> "batman forever")', co_occurrences.todense() 
sum_occ = np.sum(co_occurrences.todense(),axis=0) 
print 'Sum of word-word occurrences:', sum_occ 
print 'Pretty printig of co_occurrences count:', zip(bigram_vectorizer.get_feature_names(),np.array(sum_occ)[0].tolist())

最終輸出是('awesome unicorns', 1), ('batman forever', 2)，這正好符合我們samples提供的數據。

來源

2016-02-22 22:20:48

這是我在scikit-learn中使用CountVectorizer的示例解決方案。參考這個post，你可以簡單地使用矩陣乘法來獲得單詞共現矩陣。

from sklearn.feature_extraction.text import CountVectorizer 
docs = ['this this this book', 
     'this cat good', 
     'cat good shit'] 
count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model 
X = count_model.fit_transform(docs) 
Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format 
Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0 
print(Xc.todense()) # print out matrix in dense format

您也可以參考字的字典中count_model，

count_model.vocabulary_

或者，如果你想通過角成分正常化（簡稱在以前的帖子接聽）。

import scipy.sparse as sp 
Xc = (X.T * X) 
g = sp.diags(1./Xc.diagonal()) 
Xc_norm = g * XC# normalized co-occurence matrix

來源

2016-06-14 22:12:51 titipata

word-word co-occurrence matrix

回答

相關問題