Frequency Per Term - R TM DocumentTermMatrix

我對R非常陌生，無法將自己的頭圍繞DocumentTermMatrixs。我有一個使用TM包創建的DocumentTermMatrix，它有術語頻率和其中的術語，但我無法弄清楚如何訪問它們。Frequency Per Term - R TM DocumentTermMatrix

理想情況下，我想：

Term # 
    "the" 200 
    "is" 400 
    "a" 200

目前我的代碼是：

library(tm) 
    common.words <- c("amp","@RT","I","http","https", stopwords("english"), "you") 
    x <- Corpus(VectorSource(results)) 
    x <- tm_map(x, stripWhitespace) 
    x <- tm_map(x, removeNumbers) 
    x <- tm_map(x, removePunctuation) 
    x <- tm_map(x, stripWhitespace) 

    dtm <- DocumentTermMatrix(x) 
    for(i in 1:length(common.words)) { 
    dtm <- dtm[,!colnames(dtm)%in%c(common.words[i])] 
    }

這是海峽輸出（DTM）

List of 6 
    $ i  : int [1:9769] 1 1 1 1 1 1 1 1 2 2 ... 
    $ j  : int [1:9769] 1596 1684 1858 2112 2175 2490 2714 2814 873 961 ... 
    $ v  : num [1:9769] 1 1 2 1 1 2 1 1 1 1 ... 
    $ nrow : int 1477 
    $ ncol : int 3201 
    $ dimnames:List of 2 
    ..$ Docs : chr [1:1477] "1" "2" "3" "4" ... 
    ..$ Terms: chr [1:3201] "\u0093\u0085a" "aardvark" "aaron" "abbie" ... 
    - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix" 
    - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

謝謝

-A

來源

2013-01-20 user1994952

你可以試試'檢查（dtm） – agstudy

它似乎是數據的稀疏矩陣組織。看起來頻率在「v」列表中，你可以通過在術語屬性中查找術語的位置來獲得該頻率。爲什麼不提供dput(head(results, 30))，以便您的代碼（和您的SO讀者）能夠使用某些東西？行走各地的封裝中的例子之後，我懷疑你真正想要的線沿線的東西：

tdm <- TermDocumentMatrix(x) 
z <- inspect(tdm[ c("the", "is", "a"), dimnames(tdm)$Docs]) 
rowSums(z)

來源

2013-01-20 18:21:38

我有同樣的問題，發現什麼，我認爲是一個簡單的方法：

num <- 10 # Show this many top frequent terms 

tdm[findFreqTerms(tdm)[1:num],] %>% 
     as.matrix() %>% 
     rowSums()

列的打印是棘手的（我敢肯定有人有比這更好的方式）：

terms <- findFreqTerms(tdm)[1:num] 
tdm[terms,] %>% 
     as.matrix() %>% 
     rowSums() %>% 
     data.frame(Term = terms, Frequency = .) %>% 
     arrange(desc(Frequency))

來源

2014-12-22 16:46:50 Bob

Frequency Per Term - R TM DocumentTermMatrix

回答

相關問題