在（稀疏）文檔 - 特徵矩陣中拆分ngram

這是一個跟進問題到this之一。在那裏，我問是否有可能以文檔特徵矩陣（quanteda-package中的dfm-class）分割ngram-features， bigrams導致兩個不同的unigrams。在（稀疏）文檔 - 特徵矩陣中拆分ngram

爲了更好的理解：我在dfm中獲得了將德文翻譯成英文的ngram。化合物（「Emissionsminderung」）在德語中很平常，但不是英語（「減排」）。

library(quanteda) 

eg.txt <- c('increase in_the great plenary', 
      'great plenary emission_reduction', 
      'increase in_the emission_reduction emission_increase') 
eg.corp <- corpus(eg.txt) 
eg.dfm <- dfm(eg.corp)

有一個很好的answer這個例子，它適用於比較小的矩陣作爲上面的一個精絕。但是，矩陣越大，我就會不斷遇到以下內存錯誤。

> #turn the dfm into a matrix 
> DF <- as.data.frame(eg.dfm) 
Error in asMethod(object) : 
    Cholmod-error 'problem too large' at file ../Core/cholmod_dense.c, line 105

因此，有沒有解決這個的ngram-問題或處理大（稀疏）矩陣/數據幀一個以上存儲器高效的方法？先謝謝你！

來源

2017-06-14 uyanik

這裏的問題是，當您調用as.data.frame()時，您正在將稀疏（dfm）矩陣轉換爲密集對象。由於典型的文檔特徵矩陣是90％稀疏的，這意味着您創建的東西比您能處理的要大。解決方案：使用dfm處理函數來保持稀疏性。

請注意，這是一個比linked question中提出的更好的解決方案，但也應該爲您的更大的對象有效地工作。

這是一個功能，可以做到這一點。它允許您設置連接符字符，並使用可變大小的ngram。最重要的是，它使用dfm方法來確保dfm保持稀疏。

# function to split and duplicate counts in features containing 
# the concatenator character 
dfm_splitgrams <- function(x, concatenator = "_") { 
    # separate the unigrams 
    x_unigrams <- dfm_remove(x, concatenator, valuetype = "regex") 

    # separate the ngrams 
    x_ngrams <- dfm_select(x, concatenator, valuetype = "regex") 
    # split into components 
    split_ngrams <- stringi::stri_split_regex(featnames(x_ngrams), concatenator) 
    # get a repeated index for the ngram feature names 
    index_split_ngrams <- rep(featnames(x_ngrams), lengths(split_ngrams)) 
    # subset the ngram matrix using the (repeated) ngram feature names 
    x_split_ngrams <- x_ngrams[, index_split_ngrams] 
    # assign the ngram dfm the feature names of the split ngrams 
    colnames(x_split_ngrams) <- unlist(split_ngrams, use.names = FALSE) 

    # return the column concatenation of unigrams and split ngrams 
    suppressWarnings(cbind(x_unigrams, x_split_ngrams)) 
}

所以：

dfm_splitgrams(eg.dfm) 
## Document-feature matrix of: 3 documents, 9 features (40.7% sparse). 
## 3 x 9 sparse Matrix of class "dfmSparse" 
##  features 
## docs increase great plenary in the emission reduction emission increase 
## text1  1  1  1 1 1  0   0  0  0 
## text2  0  1  1 0 0  1   1  0  0 
## text3  1  0  0 1 1  1   1  1  1

這裏，分裂的n-gram在新的相同功能名稱的「對unigram」的結果。您可以（重新）將它們有效地結合起來：dfm_compress()：

dfm_compress(dfm_splitgrams(eg.dfm)) 
## Document-feature matrix of: 3 documents, 7 features (33.3% sparse). 
## 3 x 7 sparse Matrix of class "dfmSparse" 
##  features 
## docs increase great plenary in the emission reduction 
## text1  1  1  1 1 1  0   0 
## text2  0  1  1 0 0  1   1 
## text3  2  0  0 1 1  2   1

來源

2017-06-14 13:17:06

真是太棒了！您的功能運行絕對平穩，快速，並且沒有任何錯誤。非常感謝你！ – uyanik

在（稀疏）文檔 - 特徵矩陣中拆分ngram

回答

相關問題