如何在使用或不使用文檔術語矩陣的情況下列出術語頻率數據？

輸入

Big Fat Apple   3 
Small Fat Apple  2 
Little Small Pear  1

預期輸出：

Big = 3 
Fat = 3+2=5 
Apple = 3+2=5 
Small = 2+1=3 
Little = 1 
Pear = 1

我試圖讓文檔詞矩陣對待這個爲主體，但我無法找到一種方式來實現「大胖子蘋果」實際上出現在語料庫中的方式：「大胖子蘋果大胖子蘋果大胖子蘋果」。

有沒有辦法做這樣的製表？理想情況下，我很樂意以文檔術語矩陣的形式輸入，以便我可以使用其他功能。

來源

2015-04-22 Jia He Lim

您是否嘗試過'tm'包？ http://cran.r-project.org/web/packages/tm/ – rmuc8

變換這類數據幀分成主體，你必須明確地告訴它的每個文本應該被複制x次，使用rep()

d <- data.frame(
    text=c("Big Fat Apple", 
     "Small Fat Apple", 
     "Little Small Pear"), 
    n = c(3,2,1),stringsAsFactors=FALSE) 

library(tm) 
corpus <- Corpus(VectorSource(rep(d$text,d$n))) 
dtm <- DocumentTermMatrix(corpus)

然後可以計算術語頻率（見How to find term frequency within a DTM in R?）。

來源

2015-04-22 09:17:57 scoa

運用@ SCOA的回答樣本數據，你可以嘗試使用cSplit從我的「splitstackshape」包，就像這樣：

> library(splitstackshape) 
> cSplit(d, "text", " ", "long")[, sum(n), by = text] 
    text V1 
1: Big 3 
2: Fat 5 
3: Apple 5 
4: Small 3 
5: Little 1 
6: Pear 1

來源

2015-04-24 06:45:18 A5C1D2H2I1M1N2O1R2T1

我建議你說說quanteda package（文本數據的定量分析）。您希望可以通過tokenising和列表兩種走近什麼，或創建文檔特徵矩陣（在這裏，有一個單一的文件）：

cat("Big Fat Apple   3 
    Small Fat Apple  2 
    Little Small Pear  1\n", file = "example.txt") 
mydata <- read.table("example.txt", stringsAsFactors = FALSE) 
mydata <- paste(with(mydata, rep(paste(V1, V2, V3), V4)), collapse = " ") 
mydata 
## [1] "Big Fat Apple Big Fat Apple Big Fat Apple Small Fat Apple Small Fat Apple Little Small Pear" 

# use the quanteda package as an alternative to tm 
install.packages("quanteda") 
library(quanteda) 
# can simply tokenize and tabulate 
table(tokenize(mydata)) 
## apple big fat little pear small 
##  5  3  5  1  1  3 

# alternatively, can create a one-document document-term matrix 
myDfm <- dfm(mydata) 
## Creating a dfm from a character vector ... 
## ... indexing 1 document 
## ... tokenizing texts, found 18 total tokens 
## ... cleaning the tokens, 0 removed entirely 
## ... summing tokens by document 
## ... indexing 6 feature types 
## ... building sparse matrix 
## ... created a 1 x 6 sparse dfm 
## ... complete. Elapsed time: 0.011 seconds. 
myDfm 
## Document-feature matrix of: 1 document, 6 features. 
## 1 x 6 sparse Matrix of class "dfmSparse" 
## features 
## docs apple big fat little pear small 
## text1  5 3 5  1 1  3

樂意幫助任何問題，你可能有關於quanteda，因爲我們正在積極尋求改進。

來源

2015-05-31 16:26:18

如何在使用或不使用文檔術語矩陣的情況下列出術語頻率數據？

回答

相關問題