第一AF一切,這裏是示例數據重現,我有問題,我會盡量波紋管解釋: https://drive.google.com/file/d/0B4RCdYlVF8otUll6V2x0cDJORGc/view?usp=sharing相同的值但不同的結果?關於removeSparseTerms(R)
的問題是,我得到removeSparseTerms不同的結果,儘管引入它具有相同的價值。它似乎無視人類的邏輯,或者至少是一個人的邏輯。我有這樣的功能:
generateTDM <- function (Room_name, dest.train, RST){
s.dir <- sprintf("%s/%s", dest.train, Room_name)
s.cor <- Corpus(DirSource(directory = s.dir, pattern = "txt", encoding = "UTF-8")) #Crea unos corpora de los archivos txt ya limpios.
s.tdm <- TermDocumentMatrix(s.cor, control = list(bounds = list(local = c(2, Inf)), tokenize = TrigramTokenizer)) #Crea una matriz de terminos a partir de los corpora teniendo en cuenta unigramas, bigramas y trigramas.
s.tdm <- removeSparseTerms(s.tdm, RST) #Mantiene aquellos términos que aparezcan en el (1-RST)% de los archivos, el resto los elimina.
}
那麼,當我這樣調用此函數:
tdm.train <- lapply(Room_name, generateTDM, dest.train, RST[p])
我得到的,其中是位於取決於其它元素的向量內的可變RST功能不同的輸出。也就是說,儘管價值相同,但我得到了不同的結果。
例如:
情況1:
RST <-seq (0.45, 0.6, 0.05)
p<-4
我將RST =(0.45,0.5,0.55,0.6),然後RST [P]是0.6。
結果在這種情況下:
> tdm.train
[[1]]
<<TermDocumentMatrix (terms: 84, documents: 51)>>
Non-/sparse entries: 2451/1833
Sparsity : 43%
Maximal term length: 10
Weighting : term frequency (tf)
[[2]]
<<TermDocumentMatrix (terms: 82, documents: 52)>>
Non-/sparse entries: 2409/1855
Sparsity : 44%
Maximal term length: 11
Weighting : term frequency (tf)
[[3]]
<<TermDocumentMatrix (terms: 68, documents: 51)>>
Non-/sparse entries: 1926/1542
Sparsity : 44%
Maximal term length: 13
Weighting : term frequency (tf)
[[4]]
<<TermDocumentMatrix (terms: 36, documents: 48)>>
Non-/sparse entries: 985/743
Sparsity : 43%
Maximal term length: 10
Weighting : term frequency (tf)
[[5]]
<<TermDocumentMatrix (terms: 48, documents: 50)>>
Non-/sparse entries: 1295/1105
Sparsity : 46%
Maximal term length: 10
Weighting : term frequency (tf)
[[6]]
<<TermDocumentMatrix (terms: 27, documents: 50)>>
Non-/sparse entries: 756/594
Sparsity : 44%
Maximal term length: 8
Weighting : term frequency (tf)
情況2:
RST <-seq (0.45, 0.8, 0.05)
p<-4
我現在有RST =(0.45,0.5,0.55,0.6,0.65,0.7%,0.75 ,0.8),ergo RST [p]與此次相同(0.6)。
那麼,爲什麼我有不同的結果?我無法理解它。
> tdm.train
[[1]]
<<TermDocumentMatrix (terms: 84, documents: 51)>>
Non-/sparse entries: 2451/1833
Sparsity : 43%
Maximal term length: 10
Weighting : term frequency (tf)
[[2]]
<<TermDocumentMatrix (terms: 82, documents: 52)>>
Non-/sparse entries: 2409/1855
Sparsity : 44%
Maximal term length: 11
Weighting : term frequency (tf)
[[3]]
<<TermDocumentMatrix (terms: 68, documents: 51)>>
Non-/sparse entries: 1926/1542
Sparsity : 44%
Maximal term length: 13
Weighting : term frequency (tf)
[[4]]
<<TermDocumentMatrix (terms: 36, documents: 48)>>
Non-/sparse entries: 985/743
Sparsity : 43%
Maximal term length: 10
Weighting : term frequency (tf)
[[5]]
<<TermDocumentMatrix (terms: 57, documents: 50)>>
Non-/sparse entries: 1475/1375
Sparsity : 48%
Maximal term length: 10
Weighting : term frequency (tf)
[[6]]
<<TermDocumentMatrix (terms: 34, documents: 50)>>
Non-/sparse entries: 896/804
Sparsity : 47%
Maximal term length: 8
Weighting : term frequency (tf)
我不知道......這很奇怪,對吧?如果RST的值相同,爲什麼最後兩個dirs中removeSparseTerms的結果在每種情況下都不相同。請幫助我,不知道原因是在殺我。
非常感謝你,祝你有美好的一天。
重複的例子,基於對OP的更新:
library(tm)
library(RWeka)
download.file("https://docs.google.com/uc?authuser=0&id=0B4RCdYlVF8otUll6V2x0cDJORGc&export=download", tf <- tempfile(fileext = ".zip"), mode = "wb")
unzip(tf, exdir = tempdir())
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
generateTDM <- function (Room_name, dest.train, rst){
s.dir <- sprintf("%s/%s", dest.train, Room_name)
s.cor <- Corpus(DirSource(directory = s.dir, pattern = "txt", encoding = "UTF-8")) #Crea unos corpora de los archivos txt ya limpios.
s.tdm <- TermDocumentMatrix(s.cor, control = list(bounds = list(local = c(2, Inf)), tokenize = TrigramTokenizer)) #Crea una matriz de terminos a partir de los corpora teniendo en cuenta unigramas, bigramas y trigramas.
t <- table(s.tdm$i) > (s.tdm$ncol * (1 - rst)) # from tm::removeSparseTerms()
termIndex <- as.numeric(names(t[t]))
return(s.tdm[termIndex, ])
}
dest.train <- file.path(tempdir(), "stackoverflow", "TrainDocs")
Room_name <- "Venus"
p <- 4
RST1 <- seq(0.45, 0.6, 0.05)
RST2 <- seq(0.45, 0.8, 0.05)
RST2[p]
# [1] 0.6
RST1[p]
# [1] 0.6
identical(RST2[p], RST1[p])
# [1] FALSE # ?!?
lapply(Room_name, generateTDM, dest.train, RST1[p])
# <<TermDocumentMatrix (terms: 48, documents: 50)>>
lapply(Room_name, generateTDM, dest.train, RST2[p])
# <<TermDocumentMatrix (terms: 57, documents: 50)>> # ?!?
Imho它會更好地突出差異並提供複製的示例數據,而不是強調「我不知道......我無法理解」多次。 :-) – lukeA
是的,沒錯。我將準備一個帶有重要文檔和腳本部分的zip文件,以便儘快將其附加到此處。對不起。 –
完成。附加的示例數據和句子中的壓力水平降低了一點。 :) –