這裏有一個方法,但是又爲什麼不使用TM包中的數據結構如下?
## Your data
## dat <- structure(list(person = structure(1:5, .Label = c("Doc1", "Doc2",
## "Doc3", "Doc4", "Doc5"), class = "factor"),
## text = c("the test was to test the test",
## "we did prepare the exam to test the exam", "was the test the exam",
## "the exam we did prepare was to test the test",
## "we were successful so we all passed the exam"
## )), .Names = c("doc", "text"), class = "data.frame", row.names = c(NA,
## -5L))
## Function to turn list of vects into sparse matrix
mtabulate <- function(vects) {
lev <- sort(unique(unlist(vects)))
dat <- do.call(rbind, lapply(vects, function(x, lev){
tabulate(factor(x, levels = lev, ordered = TRUE),
nbins = length(lev))}, lev = lev))
colnames(dat) <- sort(lev)
data.frame(dat, check.names = FALSE)
}
out <- lapply(split(dat$text, dat$doc), function(x) {
unlist(strsplit(tolower(x), " "))
})
t(mtabulate(out))
## Doc1 Doc2 Doc3 Doc4 Doc5
## all 0 0 0 0 1
## did 0 1 0 1 0
## exam 0 2 1 1 1
## passed 0 0 0 0 1
## prepare 0 1 0 1 0
## so 0 0 0 0 1
## successful 0 0 0 0 1
## test 3 1 1 2 0
## the 2 2 2 2 1
## to 1 1 0 1 0
## was 1 0 1 1 0
## we 0 1 0 1 2
## were 0 0 0 0 1
你可以從'tm'軟件包中查看源代碼並重寫它......爲什麼你不想使用現成的工具? – Justin
我首先看看已經存在的函數的源代碼。 – David