0
也許我誤解了tm::DocumentTermMatrix
的工作原理。我有一個語料庫其預處理後看起來是這樣的:TM DocumentTermMatrix給出了令人意想不到的結果給出了語料庫
head(Description.text, 3)
[1] "azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung ospedal segu bus tram"
[2] "torin croll controsoffitt repart pediatr martin mag cartell compars sest pian ospedal martin torin ospedal tofan sol due anno riapertur"
[3] "ospedal martin croll controsoffitt repart pediatr mag ospedal martin croll controsoffitt repart pediatr distacc intonac avven nott mattin"
我通過過程:
Description.text.features <- DocumentTermMatrix(Corpus(VectorSource(Description.text)), list(
bounds = list(local = c(3, Inf)),
tokenize = 'scan'
))
當我檢查DTM的第一行,我得到這樣的:
inspect(Description.text.features[1,])
<<DocumentTermMatrix (documents: 1, terms: 887)>>
Non-/sparse entries: 0/887
Sparsity : 100%
Maximal term length: 15
Weighting : term frequency (tf)
Sample :
Terms
Docs banc camill mar martin ospedal presid san sanitar torin vittor
1 0 0 0 0 0 0 0 0 0 0
這些術語不對應於語料庫Description.text
中的第一個文檔(例如,banc
或camill
不在第一個文檔中,例如martin
或presid
哪個)。
而且如果我運行:
Description.text.features[1,] %>% as.matrix() %>% sum
我得到零,表明該頭文件中有與頻率>零沒有條件!
這是怎麼回事?
感謝
UPDATE
我創建了自己的「語料庫DTM」功能,實際上它提供了非常不同的結果。除了文檔術語的權重與tm::DocumentTermMatrix
(我的預期是給定語料庫)的權重非常不同之外,我的函數比tm函數(〜3000與800的tm)要多得多。
這裏是我的功能:
corpus.to.DTM <- function(corpus, min.doc.freq = 3, minlength = 3, weight.fun = weightTfIdf) {
library(dplyr)
library(magrittr)
library(tm)
library(parallel)
lvls <- mclapply(corpus, function(doc) words(doc) %>% unique, mc.cores = 8) %>%
unlist %>%
table %>%
data.frame %>%
set_colnames(c('term', 'freq')) %>%
mutate(lengths = str_length(term)) %>%
filter(freq >= min.doc.freq & lengths >= minlength) %>%
use_series(term)
dtm <- mclapply(corpus, function(doc) factor(words(doc), levels = lvls) %>% table %>% as.vector, mc.cores = 8) %>%
do.call(what = 'rbind') %>%
set_colnames(lvls)
as.DocumentTermMatrix(dtm, weighting = weightTfIdf) %>%
as.matrix() %>%
as.data.frame()
}
謝謝你的建議!我會看看這個軟件包!但我的問題特別是關於tm出了什麼問題! – Bakaburg