2015-04-22 34 views
2

我想列出如下數據:如何在使用或不使用文檔術語矩陣的情況下列出術語頻率數據?

輸入

Big Fat Apple   3 
Small Fat Apple  2 
Little Small Pear  1 

預期輸出:

Big = 3 
Fat = 3+2=5 
Apple = 3+2=5 
Small = 2+1=3 
Little = 1 
Pear = 1 

我試圖讓文檔詞矩陣對待這個爲主體,但我無法找到一種方式來實現「大胖子蘋果」實際上出現在語料庫中的方式:「大胖子蘋果大胖子蘋果大胖子蘋果」。

有沒有辦法做這樣的製表?理想情況下,我很樂意以文檔術語矩陣的形式輸入,以便我可以使用其他功能。

+0

您是否嘗試過'tm'包? http://cran.r-project.org/web/packages/tm/ – rmuc8

回答

1

變換這類數據幀分成主體,你必須明確地告訴它的每個文本應該被複制x次,使用rep()

d <- data.frame(
    text=c("Big Fat Apple", 
     "Small Fat Apple", 
     "Little Small Pear"), 
    n = c(3,2,1),stringsAsFactors=FALSE) 

library(tm) 
corpus <- Corpus(VectorSource(rep(d$text,d$n))) 
dtm <- DocumentTermMatrix(corpus) 

然後可以計算術語頻率(見How to find term frequency within a DTM in R?)。

1

運用@ SCOA的回答樣本數據,你可以嘗試使用cSplit從我的「splitstackshape」包,就像這樣:

> library(splitstackshape) 
> cSplit(d, "text", " ", "long")[, sum(n), by = text] 
    text V1 
1: Big 3 
2: Fat 5 
3: Apple 5 
4: Small 3 
5: Little 1 
6: Pear 1 
1

我建議你說說quanteda package(文本數據的定量分析)。您希望可以通過tokenising和列表兩種走近什麼,或創建文檔特徵矩陣(在這裏,有一個單一的文件):

cat("Big Fat Apple   3 
    Small Fat Apple  2 
    Little Small Pear  1\n", file = "example.txt") 
mydata <- read.table("example.txt", stringsAsFactors = FALSE) 
mydata <- paste(with(mydata, rep(paste(V1, V2, V3), V4)), collapse = " ") 
mydata 
## [1] "Big Fat Apple Big Fat Apple Big Fat Apple Small Fat Apple Small Fat Apple Little Small Pear" 

# use the quanteda package as an alternative to tm 
install.packages("quanteda") 
library(quanteda) 
# can simply tokenize and tabulate 
table(tokenize(mydata)) 
## apple big fat little pear small 
##  5  3  5  1  1  3 

# alternatively, can create a one-document document-term matrix 
myDfm <- dfm(mydata) 
## Creating a dfm from a character vector ... 
## ... indexing 1 document 
## ... tokenizing texts, found 18 total tokens 
## ... cleaning the tokens, 0 removed entirely 
## ... summing tokens by document 
## ... indexing 6 feature types 
## ... building sparse matrix 
## ... created a 1 x 6 sparse dfm 
## ... complete. Elapsed time: 0.011 seconds. 
myDfm 
## Document-feature matrix of: 1 document, 6 features. 
## 1 x 6 sparse Matrix of class "dfmSparse" 
## features 
## docs apple big fat little pear small 
## text1  5 3 5  1 1  3 

樂意幫助任何問題,你可能有關於quanteda,因爲我們正在積極尋求改進。