從R中的語料庫中統計單個文檔中的單詞並將其放入數據框中

我已經獲得了文本文檔，每個文檔中都有文本特色電視劇集劇本。每個文檔都是不同的系列。我想比較每個系列中最常用的單詞，我想我可以使用ggplot對它們進行繪圖，並且在一個軸上有'系列1術語至少出現x次'，而'系列2術語至少出現X次' 另外一個。我期望我需要的是一個包含3列'條款'，'系列x'，'系列Y'的數據框。系列x和y具有單詞出現的次數。從R中的語料庫中統計單個文檔中的單詞並將其放入數據框中

我嘗試了多種方法來做到這一點，但失敗了。我已經得到了最接近的是我可以在一列中的所有條款讀取語料庫，並創建一個數據幀像這樣：

library("tm") 

corpus <-Corpus(DirSource("series")) 
corpus.p <-tm_map(corpus, removeWords, stopwords("english")) #removes stopwords 
corpus.p <-tm_map(corpus.p, stripWhitespace) #removes stopwords 
corpus.p <-tm_map(corpus.p, tolower) 
corpus.p <-tm_map(corpus.p, removeNumbers) 
corpus.p <-tm_map(corpus.p, removePunctuation) 
dtm <-DocumentTermMatrix(corpus.p) 
docTermMatrix <- inspect(dtm) 
termCountFrame <- data.frame(Term = colnames(docTermMatrix))

然後我知道我可以添加一列加了這樣的話：

termCountFrame$seriesX <- colSums(docTermMatrix)

但是，當我只需要一個文件時，會從兩個文件中添加事件。

所以我的問題是：

1）是否有可能在一個單一的文檔使用colSums，如果不是有另一種方式打開doctermmatrix與長期計數一個數據幀爲每個文檔

2 ）是否有人知道如何限制這個，所以我得到了最常用的術語每個文檔

來源

2013-06-25 ds10

如果您的數據位於文檔術語表中，您可以使用tm::findFreqTerms來獲取文檔中最常用的術語。這裏有一個重複的例子：

require(tm) 
data(crude) 
dtm <- DocumentTermMatrix(crude) 
dtm 
A document-term matrix (20 documents, 1266 terms) 

Non-/sparse entries: 2255/23065 
Sparsity   : 91% 
Maximal term length: 17 
Weighting   : term frequency (tf) 

# find most frequent terms in all 20 docs 
findFreqTerms(dtm, 2, 100) 

# find the doc names 
dtm$dimnames$Docs 
[1] "127" "144" "191" "194" "211" "236" "237" "242" "246" "248" "273" "349" "352" "353" "368" "489" "502" 
[18] "543" "704" "708" 

# do freq words on one doc 
findFreqTerms(dtm[dtm$dimnames$Docs == "127"], 2, 100) 
[1] "crude"  "cut"  "diamond" "dlrs"  "for"  "its"  "oil"  "price"  
[9] "prices" "reduction" "said."  "that"  "the"  "today"  "weak"

這裏是你如何在同一時間找到最頻繁出現的詞彙在DTM每份文件，一個文件：

# find freq words for each doc, one by one 
list_freqs <- lapply(dtm$dimnames$Docs, 
       function(i) findFreqTerms(dtm[dtm$dimnames$Docs == i], 2, 100)) 


list_freqs 
[[1]] 
[1] "crude"  "cut"  "diamond" "dlrs"  "for"  "its"  "oil"  "price"  
[9] "prices" "reduction" "said."  "that"  "the"  "today"  "weak"  

[[2]] 
[2] "\"opec"  "\"the"  "15.8"   "ability"  "above"  "address"  "agreement" 
[8] "analysts"  "and"   "before"  "bpd"   "but"   "buyers"  "current"  
[15] "demand"  "emergency" "energy"  "for"   "has"   "have"   "higher"  
[22] "hold"   "industry"  "its"   "keep"   "market"  "may"   "meet"   
[29] "meeting"  "mizrahi"  "mln"   "must"   "next"   "not"   "now"   
[36] "oil"   "opec"   "organization" "prices"  "problem"  "production" "said"   
[43] "said."  "set"   "that"   "the"   "their"  "they"   "this"   
[50] "through"  "will"   

[[3]] 
[3] "canada" "canadian" "crude" "for"  "oil"  "price" "texaco" "the"  

[[4]] 
[4] "bbl." "crude" "dlrs" "for"  "price" "reduced" "texas" "the"  "west" 

[[5]] 
[5] "and"  "discounted" "estimates" "for"  "mln"  "net"  "pct"  "present" 
[9] "reserves" "revenues" "said"  "study"  "that"  "the"  "trust"  "value"  

[[6]] 
[6] "ability"  "above"   "ali"   "and"   "are"   "barrel."  
[7] "because"  "below"   "bpd"   "bpd."   "but"   "daily"   
[13] "difficulties" "dlrs"   "dollars"  "expected"  "for"   "had"   
[19] "has"   "international" "its"   "kuwait"  "last"   "local"   
[25] "march"   "markets"  "meeting"  "minister"  "mln"   "month"   
[31] "official"  "oil"   "opec"   "opec\"s"  "prices"  "producing"  
[37] "pumping"  "qatar,"  "quota"   "referring"  "said"   "said."   
[43] "sheikh"  "such"   "than"   "that"   "the"   "their"   
[49] "they"   "this"   "was"   "were"   "which"   "will"   

[[7]] 
[7] "\"this"  "and"   "appears"  "are"   "areas"   "bank"   
[7] "bankers"  "been"   "but"   "crossroads" "crucial"  "economic"  
[13] "economy"  "embassy"  "fall"   "for"   "general"  "government" 
[19] "growth"  "has"   "have"   "indonesia\"s" "indonesia," "international" 
[25] "its"   "last"   "measures"  "nearing"  "new"   "oil"   
[31] "over"   "rate"   "reduced"  "report"  "say"   "says"   
[37] "says."   "sector"  "since"   "the"   "u.s."   "was"   
[43] "which"   "with"   "world"   

[[8]] 
[8] "after"  "and"  "deposits" "had"  "oil"  "opec"  "pct"  "quotes"  
[9] "riyal"  "said"  "the"  "were"  "yesterday." 

[[9]] 
[9] "1985/86"  "1986/87"  "1987/88"  "abdul-aziz" "about"  "and"   "been"  
[8] "billion"  "budget"  "deficit"  "expenditure" "fiscal"  "for"   "government" 
[15] "had"   "its"   "last"  "limit"  "oil"   "projected" "public"  
[22] "qatar,"  "revenue"  "riyals"  "riyals."  "said"  "sheikh"  "shortfall" 
[29] "that"  "the"   "was"   "would"  "year"  "year's"  

[[10]] 
[10] "15.8"  "about"  "above"  "accord" "agency" "ali"  "among"  "and"  
[9] "arabia" "are"  "dlrs"  "for"  "free"  "its"  "kuwait" "market" 
[17] "market," "minister," "mln"  "nazer"  "oil"  "opec"  "prices" "producing" 
[25] "quoted" "recent" "said"  "said."  "saudi"  "sheikh" "spa"  "stick"  
[33] "that"  "the"  "they"  "under"  "was"  "which"  "with"  

[[11]] 
[11] "1.2"  "and"  "appeared" "arabia's" "average" "barrel." "because" "below"  
[9] "bpd"  "but"  "corp"  "crude"  "december" "dlrs"  "export"  "exports" 
[17] "february" "fell"  "for"  "four"  "from"  "gulf"  "january" "january," 
[25] "last"  "mln"  "month"  "month,"  "neutral" "official" "oil"  "opec"  
[33] "output"  "prices"  "production" "refinery" "said"  "said."  "saudi"  "sell"  
[41] "sources" "than"  "the"  "they"  "throughput" "week"  "yanbu"  "zone"  

[[12]] 
[12] "and"  "arab"  "crude"  "emirates" "gulf"  "ministers" "official" "oil"  
[9] "states" "the"  "wam"  

[[13]] 
[13] "accord" "agency" "and" "arabia" "its" "nazer" "oil" "opec" "prices" "saudi" "the" 
[12] "under" 

[[14]] 
[14] "crude" "daily" "for"  "its"  "oil"  "opec" "pumping" "that" "the"  "was"  

[[15]] 
[15] "after" "closed" "new"  "nuclear" "oil"  "plant" "port" "power" "said" "ship" 
[11] "the"  "was"  "when" 

[[16]] 
[16] "about"  "and"   "development" "exploration" "for"   "from"  "help"  
[8] "its"   "mln"   "oil"   "one"   "present"  "prices"  "research" 
[15] "reserve"  "said"  "strategic" "the"   "u.s."  "with"  "would"  

[[17]] 
[17] "about"  "and"   "benefits" "development" "exploration" "for"   "from"  
[8] "group"  "help"  "its"   "mln"   "oil"   "one"   "policy"  
[15] "present"  "prices"  "protect"  "research" "reserve"  "said"  "strategic" 
[22] "study"  "such"  "the"   "u.s."  "with"  "would"  

[[18]] 
[18] "1.50" "company" "crude" "dlrs" "for"  "its"  "lowered" "oil"  "posted" "prices" 
[11] "said" "said." "the"  "union" "west" 

[[19]] 
[19] "according" "and"   "april"  "before"  "can"   "change"  "efp"   
[8] "energy"  "entering"  "exchange"  "for"   "futures"  "has"   "hold"   
[15] "increase"  "into"   "mckiernan" "new"   "not"   "nymex"  "oil"   
[22] "one"   "position"  "prices"  "rule"   "said"   "spokeswoman." "that"   
[29] "the"   "traders"  "transaction" "when"   "will"   

[[20]] 
[20] "1986,"  "1987"   "billion"  "cubic"  "fiscales"  "january"  "mln"   
[8] "pct"   "petroliferos" "yacimientos"

如果你想要這個輸出一個數據框，你可以這樣做：

# from here http://stackoverflow.com/a/7196565/1036500 
L <- list_freqs 
cfun <- function(L) { 
    pad.na <- function(x,len) { 
    c(x,rep(NA,len-length(x))) 
    } 
    maxlen <- max(sapply(L,length)) 
    do.call(data.frame,lapply(L,pad.na,len=maxlen)) 
} 
# make dataframe of words (but probably you want words as rownames and cells with counts?) 
tab_freqa <- cfun(L)

但是，如果你想繪製「DOC 1個高頻率方面VS DOC 2高頻率術語」，那麼我們就需要不同的方法...

# convert dtm to matrix 
mat <- as.matrix(dtm) 

# make data frame similar to "3 columns 'Terms', 
# 'Series x', 'Series Y'. With series x and y 
# having the number of times that word occurs" 
cb <- data.frame(doc1 = mat['127',], doc2 = mat['144',]) 

# keep only words that are in at least one doc 
cb <- cb[rowSums(cb) > 0, ] 

# plot 
require(ggplot2) 
ggplot(cb, aes(doc1, doc2)) + 
    geom_text(label = rownames(cb), 
      position=position_jitter())

或許稍微更有效，我們可以使所有文檔的一個大的數據幀，使情節：

# this is the typical method to turn a 
# dtm into a df... 
df <- as.data.frame(as.matrix(dtm)) 
# and transpose for plotting 
df <- data.frame(t(df)) 
# plot 
require(ggplot2) 
ggplot(df, aes(X127, X144)) + 
    geom_text(label = rownames(df), 
      position=position_jitter())

後刪除禁用詞這樣會更好看，但是這是一個很好的概念驗證。那是你之後的事情嗎？

enter image description here

來源

2013-06-28 06:48:09 Ben

這真是太棒了。從這個答案中學到很多 – ds10

很高興幫助！「R」在處理單詞方面非常出色，這是一種專爲數字作品而設計的語言。 – Ben

如果我只想在每個文檔中使用最*頻繁的術語數據框，我該怎麼做？特別是如果我不知道該術語的頻率範圍... – Bryan

問題1）我創造，我想用T（docTermMatrix）數據幀中，然後用as.data.frame

dtm.frame <- as.data.frame(t(docTermMatrix))

來源

2013-06-25 12:26:30 ds10

這個唯一的問題是你得到的數字爲colnames，這對很多事情有問題，有沒有陰謀至少'ggplot' – Ben

Ben的回答是要走的路。 – ds10

我希望你不介意，但我已經在這裏寫了答案。我給你充分的信用：http：// paddytherabbit。com/comparison-word-usage-in-text-documents-using-r-some-basics/ – ds10

從R中的語料庫中統計單個文檔中的單詞並將其放入數據框中

回答

相關問題