我正在尋找sklearn中的一個模塊,它可以讓您導出word-word co-ocurrence矩陣。我可以得到文檔術語矩陣,但不知道如何去獲得共同出現的單詞矩陣。word-word co-occurrence matrix
8
A
回答
1
您可以在CountVectorizer
或TfidfVectorizer
代碼示例使用ngram_range
參數:
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2)) # by saying 2,2 you are telling you only want pairs of 2 words
如果你想明確地說要來算,使用vocabulary
哪些詞的共現PARAM,即:vocabulary = {'awesome unicorns':0, 'batman forever':1}
不言自明,隨時可以使用具有預定義的單詞同時出現的代碼。在這種情況下,我們只跟蹤的awesome unicorns
和batman forever
共同出現:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
samples = ['awesome unicorns are awesome','batman forever and ever','I love batman forever']
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), vocabulary = {'awesome unicorns':0, 'batman forever':1})
co_occurrences = bigram_vectorizer.fit_transform(samples)
print 'Printing sparse matrix:', co_occurrences
print 'Printing dense matrix (cols are vocabulary keys 0-> "awesome unicorns", 1-> "batman forever")', co_occurrences.todense()
sum_occ = np.sum(co_occurrences.todense(),axis=0)
print 'Sum of word-word occurrences:', sum_occ
print 'Pretty printig of co_occurrences count:', zip(bigram_vectorizer.get_feature_names(),np.array(sum_occ)[0].tolist())
最終輸出是('awesome unicorns', 1), ('batman forever', 2)
,這正好符合我們samples
提供的數據。
11
這是我在scikit-learn中使用CountVectorizer
的示例解決方案。參考這個post,你可以簡單地使用矩陣乘法來獲得單詞共現矩陣。
from sklearn.feature_extraction.text import CountVectorizer
docs = ['this this this book',
'this cat good',
'cat good shit']
count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model
X = count_model.fit_transform(docs)
Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format
Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0
print(Xc.todense()) # print out matrix in dense format
您也可以參考字的字典中count_model
,
count_model.vocabulary_
或者,如果你想通過角成分正常化(簡稱在以前的帖子接聽)。
import scipy.sparse as sp
Xc = (X.T * X)
g = sp.diags(1./Xc.diagonal())
Xc_norm = g * XC# normalized co-occurence matrix
相關問題
- 1. ?matrix和?matrix()
- 2. cuda magma matrix-matrix addition kernel
- 3. ArrayList matrix
- 4. Numpy'smart'symmetric matrix
- 5. Ruby Matrix set_element private?
- 6. Matrix Math With Sparklyr
- 7. strassen matrix multiplication
- 8. Homography matrix multiplication
- 9. numpy matrix multiplication
- 10. gluLookAt和MODELVIEW MATRIX
- 11. Numpy matrix to array
- 12. CONFUSION MATRIX,R,
- 13. Bootstrap Matrix Carousel
- 14. Matrix Class - Android SDK
- 15. importdata to large matrix
- 16. Matrix在numpy的
- 17. Matrix在MATLAB
- 18. Matrix類C#
- 19. MATLAB Matrix step plot
- 20. Matlab combined matrix
- 21. Sorting a Square Matrix
- 22. Char matrix as param
- 23. Good Matrix Libraries?
- 24. Int [] to Matrix Java
- 25. Lapack + c + matrix
- 26. matrix/quaternion woes
- 27. Matrix Big O Notation
- 28. Color Matrix VS PixelBender
- 29. Matrix在蟒蛇
- 30. matlab-return matrix
您可以添加一些數據並嘗試解決問題嗎? – Cleb