0
我有一個語料庫,其中有15,000多個文本文檔。該removeSparseTerms功能不起作用:如何降低語料庫中文本詞矩陣的稀疏性(R)
dtm
<<DocumentTermMatrix (documents: 15095, terms: 12811)>>
Non-/sparse entries: 140286/193241759
Sparsity : 100%
Maximal term length: 37
Weighting : term frequency (tf)
dtms <- removeSparseTerms(dtm, 0.1)
dtms
<<DocumentTermMatrix (documents: 15095, terms: 0)>>
Non-/sparse entries: 0/0
Sparsity : 100%
Maximal term length: 0
Weighting : term frequency (tf)
我也試過這樣,它沒有工作:
colTotals<- col_sums(dtm)
dtm2 <- dtm[,which(colTotals>20)]
dtm2
<<DocumentTermMatrix (documents: 15095, terms: 1387)>>
Non-/sparse entries: 100867/20835898
Sparsity : 100%
Maximal term length: 26
Weighting : term frequency (tf)
還有什麼我能做的減少稀疏?我希望能夠在excel中打開頻率表,現在它需要太多的內存,所以我無法打開(這就是爲什麼我想減少稀疏性)。