2017-07-28 107 views
0

也許我誤解了tm::DocumentTermMatrix的工作原理。我有一個語料庫其預處理後看起來是這樣的:TM DocumentTermMatrix給出了令人意想不到的結果給出了語料庫

head(Description.text, 3) 
[1] "azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung ospedal segu bus tram"      
[2] "torin croll controsoffitt repart pediatr martin mag cartell compars sest pian ospedal martin torin ospedal tofan sol due anno riapertur"  
[3] "ospedal martin croll controsoffitt repart pediatr mag ospedal martin croll controsoffitt repart pediatr distacc intonac avven nott mattin" 

我通過過程:

Description.text.features <- DocumentTermMatrix(Corpus(VectorSource(Description.text)), list(
    bounds = list(local = c(3, Inf)), 
    tokenize = 'scan' 
)) 

當我檢查DTM的第一行,我得到這樣的:

inspect(Description.text.features[1,]) 
<<DocumentTermMatrix (documents: 1, terms: 887)>> 
Non-/sparse entries: 0/887 
Sparsity   : 100% 
Maximal term length: 15 
Weighting   : term frequency (tf) 
Sample    : 
    Terms 
Docs banc camill mar martin ospedal presid san sanitar torin vittor 
    1 0  0 0  0  0  0 0  0  0  0 

這些術語不對應於語料庫Description.text中的第一個文檔(例如,banccamill不在第一個文檔中,例如martinpresid哪個)。

而且如果我運行:

Description.text.features[1,] %>% as.matrix() %>% sum 

我得到零,表明該頭文件中有與頻率>零沒有條件!

這是怎麼回事?

感謝

UPDATE

我創建了自己的「語料庫DTM」功能,實際上它提供了非常不同的結果。除了文檔術語的權重與tm::DocumentTermMatrix(我的預期是給定語料庫)的權重非常不同之外,我的函數比tm函數(〜3000與800的tm)要多得多。

這裏是我的功能:

corpus.to.DTM <- function(corpus, min.doc.freq = 3, minlength = 3, weight.fun = weightTfIdf) { 
    library(dplyr) 
    library(magrittr) 
    library(tm) 
    library(parallel) 

    lvls <- mclapply(corpus, function(doc) words(doc) %>% unique, mc.cores = 8) %>% 
     unlist %>% 
     table %>% 
     data.frame %>% 
     set_colnames(c('term', 'freq')) %>% 
     mutate(lengths = str_length(term)) %>% 
     filter(freq >= min.doc.freq & lengths >= minlength) %>% 
     use_series(term) 

    dtm <- mclapply(corpus, function(doc) factor(words(doc), levels = lvls) %>% table %>% as.vector, mc.cores = 8) %>% 
     do.call(what = 'rbind') %>% 
     set_colnames(lvls) 

    as.DocumentTermMatrix(dtm, weighting = weightTfIdf) %>% 
     as.matrix() %>% 
     as.data.frame() 
} 

回答

1

下面是一個使用TM替代解決辦法,quanteda。你甚至可以找到後者的相對簡單性,加上其速度和特性,足以將其用於其餘的分析!

description.text <- 
    c("azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung ospedal segu bus tram", 
    "torin croll controsoffitt repart pediatr martin mag cartell compars sest pian ospedal martin torin ospedal tofan sol due anno riapertur", 
    "ospedal martin croll controsoffitt repart pediatr mag ospedal martin croll controsoffitt repart pediatr distacc intonac avven nott mattin") 

require(quanteda) 
require(magrittr) 

qdfm <- dfm(description.text) 
head(qdfm, nfeat = 10) 
# Document-feature matrix of: 3 documents, 35 features (56.2% sparse). 
# (showing first 3 documents and first 10 features) 
#  features 
# docs azi sanitar local to1 presid osp martin ospedalier tofan torin 
# text1 1  1  1 1  2 1  2   1  1  1 
# text2 0  0  0 0  0 0  2   0  1  2 
# text3 0  0  0 0  0 0  2   0  0  0 

qdfm2 <- qdfm %>% dfm_trim(min_count = 3, min_docfreq = 3) 
qdfm2 
# Document-feature matrix of: 3 documents, 2 features (0% sparse). 
# (showing first 3 documents and first 2 features) 
#  features 
# docs martin ospedal 
# text1  2  1 
# text2  2  2 
# text3  2  2 

轉換回TM

convert(qdfm2, to = "tm") 
# <<DocumentTermMatrix (documents: 3, terms: 2)>> 
# Non-/sparse entries: 6/0 
# Sparsity   : 0% 
# Maximal term length: 7 
# Weighting   : term frequency (tf) 

在您的例子中,你使用的TF-IDF權重。這也很容易在量子

dfm_weight(qdfm, "tfidf") %>% head 
# Document-feature matrix of: 3 documents, 35 features (56.2% sparse). 
# (showing first 3 documents and first 6 features) 
#   features 
# docs   azi sanitar  local  to1 presid  osp 
# text1 0.4771213 0.4771213 0.4771213 0.4771213 0.9542425 0.4771213 
# text2 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 
# text3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 
+0

謝謝你的建議!我會看看這個軟件包!但我的問題特別是關於tm出了什麼問題! – Bakaburg

相關問題