如何高效地計算R中稀疏矩陣的PPMI？

我會術語和情境（根據條款而─矩陣之間認爲R包之間text2vec，tm，quanteda，svs，qlcMatrix和wordspace會有一個函數來計算PPMI（正逐點互信息）長期（背景）共同發生） - 但顯然不是，所以我繼續前進，自己寫了一個。問題是，糖蜜很慢，可能是因爲我對稀疏矩陣不太好 - 而且我的tcms大約是10k * 20k，所以它們確實需要稀疏。如何高效地計算R中稀疏矩陣的PPMI？

據我瞭解，PMI = log(p(word, context)/(p(word)*p(context)))，因此我有理由相信：

  count(word_context_co-occurrence)/N 
PMI = log(------------------------------------- ) 
      count(word)/N * count(context)/N

哪裏N是所有共同出現在共生矩陣的總和。而PPMI簡直是強迫所有< 0值是0（？這是迄今爲止正確，右）

考慮到這一點，這裏是在執行的嘗試：

library(Matrix) 
set.seed(1) 
pmat = matrix(sample(c(0,0,0,0,0,0,1,10),5*10,T), 5,10, byrow=T) # tiny example matrix; 
# rows are words, columns are contexts (words the row-words co-occur with, in a certain window in the text) 
pmat = Matrix(pmat, sparse=T) # make it sparse 

# calculate some things beforehand to make it faster 
N = sum(pmat) 
contextp = Matrix::colSums(pmat)/N # probabilities of contexts 
wordp = Matrix::rowSums(pmat)/N # probabilities of terms 

# here goes nothing... 
pmat2 = pmat 
for(r in 1:nrow(pmat)){ # go term by term, calculate PPMI association with each of its contexts 
    not0 = which(pmat[r, ] > 0) # no need to consider 0 values (no co-occurrence) 
    tmp = log((pmat[r,not0]/N)/(wordp[r] * contextp[not0])) # PMI 
    tmp = ifelse(tmp < 0, 0, tmp) # PPMI 
    pmat2[r, not0] = tmp # <-- THIS here is the slow part, replacing the old frequency values with the new PPMI weighted ones. 
} 
# take a look: 
round(pmat2,2)

出現什麼是慢不是計算本身，而是將新計算的值放入稀疏矩陣中（在這個微小的例子中，它並不壞，但是如果你使它成千上萬行成千上萬的行，即使這個循環的一次迭代也將永遠存在;構造一個新的與rBind矩陣似乎是一個壞主意）。

什麼是更有效的方法來替換這種稀疏矩陣中的舊值與新的PPMI加權值？無論是建議更改此代碼，還是使用某些包中的某些現有功能，我總是錯過了 - 都很好。

來源

2017-04-11 user3554004

看看dev版本的text2vec。這裏是我如何計算短語（搭配）提取的PMI - https://github.com/dselivanov/text2vec/blob/master/R/collocations.R#L57-L76。關於你的問題 - 通常儘量避免在稀疏矩陣中逐元素訪問，它效率非常低。 –

同時也算出來了，這種工作合理快捷。如果其他人最終面臨同樣的問題，我會把它留在這裏。也似乎非常類似於在評論中鏈接到問題的方法（謝謝！）。

# this is for a column-oriented sparse matrix; transpose if necessary 
tcmrs = Matrix::rowSums(pmat) 
tcmcs = Matrix::colSums(pmat) 
N = sum(tcmrs) 
colp = tcmcs/N 
rowp = tcmrs/N 
pp = [email protected]+1 
ip = [email protected]+1 
tmpx = rep(0,length([email protected])) # new values go here, just a numeric vector 
# iterate through sparse matrix: 
for(i in 1:(length([email protected])-1)){ 
    ind = pp[i]:(pp[i+1]-1) 
    not0 = ip[ind] 
    icol = [email protected][ind] 
    tmp = log((icol/N)/(rowp[not0] * colp[i])) # PMI 
    tmpx[ind] = tmp  
} 
[email protected] = tmpx 
# to convert to PPMI, replace <0 values with 0 and do a Matrix::drop0() on the object.

來源

2017-04-15 16:38:54 user3554004

如何高效地計算R中稀疏矩陣的PPMI？

回答

相關問題