2017-02-09 80 views
0

問題:如何才能保持bigram「沒有奇妙」僅在文檔術語矩陣或我想保留的bigrams(Terms)列表中?只保留文檔術語矩陣中的特定格式R

我想將其應用於非常大的文檔術語矩陣。我嘗試將術語矩陣轉換爲矩陣,但矢量大小超過1000 Gb。

代碼:

dd <- data.frame(
id = 10:13, 
text = c("No wonderful, then, that ever", 
     "So that in many cases such a ", 
     "But there were still other and", 
     "Not even at the rationale"), stringsAsFactors = F) 

library(tm) 
library(RWeka) 

myReader <- readTabular(mapping = list(content = "text", id = "id")) 

#create v corpus 
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader)) 

#n-gram tokenizer 
Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) 

#create document term matrix using Tokenizer 
     dtm <- TermDocumentMatrix(tm, control = list(tokenize = Tokenizer)) 
     inspect(dtm) 

輸出:

       Docs 
      Terms   10 11 12 13 
      at the   0 0 0 1 
      but there  0 0 1 0 
      cases such  0 1 0 0 
      even at   0 0 0 1 
      in many   0 1 0 0 
      many cases  0 1 0 0 
      no wonderful 1 0 0 0 
      not even  0 0 0 1 
      other and  0 0 1 0 
      so that   0 1 0 0 
      still other  0 0 1 0 
      such a   0 1 0 0 
      that ever  1 0 0 0 
      that in   0 1 0 0 
      the rationale 0 0 0 1 
      then that  1 0 0 0 
      there were  0 0 1 0 
      were still  0 0 1 0 
      wonderful then 1 0 0 0 

回答

0

一直以爲是更爲複雜,因爲它是一個DTM。

問題解決了:

d_sel <- dtm[c('no wonderful', 'there were'),] 
    inspect(d_sel) 

       Docs 
       Terms   10 11 12 13 
       no wonderful 1 0 0 0 
       there were  0 0 1 0