我建議你說說quanteda
package(文本數據的定量分析)。您希望可以通過tokenising和列表兩種走近什麼,或創建文檔特徵矩陣(在這裏,有一個單一的文件):
cat("Big Fat Apple 3
Small Fat Apple 2
Little Small Pear 1\n", file = "example.txt")
mydata <- read.table("example.txt", stringsAsFactors = FALSE)
mydata <- paste(with(mydata, rep(paste(V1, V2, V3), V4)), collapse = " ")
mydata
## [1] "Big Fat Apple Big Fat Apple Big Fat Apple Small Fat Apple Small Fat Apple Little Small Pear"
# use the quanteda package as an alternative to tm
install.packages("quanteda")
library(quanteda)
# can simply tokenize and tabulate
table(tokenize(mydata))
## apple big fat little pear small
## 5 3 5 1 1 3
# alternatively, can create a one-document document-term matrix
myDfm <- dfm(mydata)
## Creating a dfm from a character vector ...
## ... indexing 1 document
## ... tokenizing texts, found 18 total tokens
## ... cleaning the tokens, 0 removed entirely
## ... summing tokens by document
## ... indexing 6 feature types
## ... building sparse matrix
## ... created a 1 x 6 sparse dfm
## ... complete. Elapsed time: 0.011 seconds.
myDfm
## Document-feature matrix of: 1 document, 6 features.
## 1 x 6 sparse Matrix of class "dfmSparse"
## features
## docs apple big fat little pear small
## text1 5 3 5 1 1 3
樂意幫助任何問題,你可能有關於quanteda
,因爲我們正在積極尋求改進。
您是否嘗試過'tm'包? http://cran.r-project.org/web/packages/tm/ – rmuc8