2013-06-25 37 views
5

我已經獲得了文本文檔,每個文檔中都有文本特色電視劇集劇本。每個文檔都是不同的系列。我想比較每個系列中最常用的單詞,我想我可以使用ggplot對它們進行繪圖,並且在一個軸上有'系列1術語至少出現x次',而'系列2術語至少出現X次' 另外一個。我期望我需要的是一個包含3列'條款','系列x','系列Y'的數據框。系列x和y具有單詞出現的次數。從R中的語料庫中統計單個文檔中的單詞並將其放入數據框中

我嘗試了多種方法來做到這一點,但失敗了。我已經得到了最接近的是我可以在一列中的所有條款讀取語料庫,並創建一個數據幀像這樣:

library("tm") 

corpus <-Corpus(DirSource("series")) 
corpus.p <-tm_map(corpus, removeWords, stopwords("english")) #removes stopwords 
corpus.p <-tm_map(corpus.p, stripWhitespace) #removes stopwords 
corpus.p <-tm_map(corpus.p, tolower) 
corpus.p <-tm_map(corpus.p, removeNumbers) 
corpus.p <-tm_map(corpus.p, removePunctuation) 
dtm <-DocumentTermMatrix(corpus.p) 
docTermMatrix <- inspect(dtm) 
termCountFrame <- data.frame(Term = colnames(docTermMatrix)) 

然後我知道我可以添加一列加了這樣的話:

termCountFrame$seriesX <- colSums(docTermMatrix) 

但是,當我只需要一個文件時,會從兩個文件中添加事件。

所以我的問題是:

1)是否有可能在一個單一的文檔使用colSums,如果不是有另一種方式打開doctermmatrix與長期計數一個數據幀爲每個文檔

2 )是否有人知道如何限制這個,所以我得到了最常用的術語每個文檔

回答

10

如果您的數據位於文檔術語表中,您可以使用tm::findFreqTerms來獲取文檔中最常用的術語。這裏有一個重複的例子:

require(tm) 
data(crude) 
dtm <- DocumentTermMatrix(crude) 
dtm 
A document-term matrix (20 documents, 1266 terms) 

Non-/sparse entries: 2255/23065 
Sparsity   : 91% 
Maximal term length: 17 
Weighting   : term frequency (tf) 

# find most frequent terms in all 20 docs 
findFreqTerms(dtm, 2, 100) 

# find the doc names 
dtm$dimnames$Docs 
[1] "127" "144" "191" "194" "211" "236" "237" "242" "246" "248" "273" "349" "352" "353" "368" "489" "502" 
[18] "543" "704" "708" 

# do freq words on one doc 
findFreqTerms(dtm[dtm$dimnames$Docs == "127"], 2, 100) 
[1] "crude"  "cut"  "diamond" "dlrs"  "for"  "its"  "oil"  "price"  
[9] "prices" "reduction" "said."  "that"  "the"  "today"  "weak" 

這裏是你如何在同一時間找到最頻繁出現的詞彙在DTM每份文件,一個文件:

# find freq words for each doc, one by one 
list_freqs <- lapply(dtm$dimnames$Docs, 
       function(i) findFreqTerms(dtm[dtm$dimnames$Docs == i], 2, 100)) 


list_freqs 
[[1]] 
[1] "crude"  "cut"  "diamond" "dlrs"  "for"  "its"  "oil"  "price"  
[9] "prices" "reduction" "said."  "that"  "the"  "today"  "weak"  

[[2]] 
[2] "\"opec"  "\"the"  "15.8"   "ability"  "above"  "address"  "agreement" 
[8] "analysts"  "and"   "before"  "bpd"   "but"   "buyers"  "current"  
[15] "demand"  "emergency" "energy"  "for"   "has"   "have"   "higher"  
[22] "hold"   "industry"  "its"   "keep"   "market"  "may"   "meet"   
[29] "meeting"  "mizrahi"  "mln"   "must"   "next"   "not"   "now"   
[36] "oil"   "opec"   "organization" "prices"  "problem"  "production" "said"   
[43] "said."  "set"   "that"   "the"   "their"  "they"   "this"   
[50] "through"  "will"   

[[3]] 
[3] "canada" "canadian" "crude" "for"  "oil"  "price" "texaco" "the"  

[[4]] 
[4] "bbl." "crude" "dlrs" "for"  "price" "reduced" "texas" "the"  "west" 

[[5]] 
[5] "and"  "discounted" "estimates" "for"  "mln"  "net"  "pct"  "present" 
[9] "reserves" "revenues" "said"  "study"  "that"  "the"  "trust"  "value"  

[[6]] 
[6] "ability"  "above"   "ali"   "and"   "are"   "barrel."  
[7] "because"  "below"   "bpd"   "bpd."   "but"   "daily"   
[13] "difficulties" "dlrs"   "dollars"  "expected"  "for"   "had"   
[19] "has"   "international" "its"   "kuwait"  "last"   "local"   
[25] "march"   "markets"  "meeting"  "minister"  "mln"   "month"   
[31] "official"  "oil"   "opec"   "opec\"s"  "prices"  "producing"  
[37] "pumping"  "qatar,"  "quota"   "referring"  "said"   "said."   
[43] "sheikh"  "such"   "than"   "that"   "the"   "their"   
[49] "they"   "this"   "was"   "were"   "which"   "will"   

[[7]] 
[7] "\"this"  "and"   "appears"  "are"   "areas"   "bank"   
[7] "bankers"  "been"   "but"   "crossroads" "crucial"  "economic"  
[13] "economy"  "embassy"  "fall"   "for"   "general"  "government" 
[19] "growth"  "has"   "have"   "indonesia\"s" "indonesia," "international" 
[25] "its"   "last"   "measures"  "nearing"  "new"   "oil"   
[31] "over"   "rate"   "reduced"  "report"  "say"   "says"   
[37] "says."   "sector"  "since"   "the"   "u.s."   "was"   
[43] "which"   "with"   "world"   

[[8]] 
[8] "after"  "and"  "deposits" "had"  "oil"  "opec"  "pct"  "quotes"  
[9] "riyal"  "said"  "the"  "were"  "yesterday." 

[[9]] 
[9] "1985/86"  "1986/87"  "1987/88"  "abdul-aziz" "about"  "and"   "been"  
[8] "billion"  "budget"  "deficit"  "expenditure" "fiscal"  "for"   "government" 
[15] "had"   "its"   "last"  "limit"  "oil"   "projected" "public"  
[22] "qatar,"  "revenue"  "riyals"  "riyals."  "said"  "sheikh"  "shortfall" 
[29] "that"  "the"   "was"   "would"  "year"  "year's"  

[[10]] 
[10] "15.8"  "about"  "above"  "accord" "agency" "ali"  "among"  "and"  
[9] "arabia" "are"  "dlrs"  "for"  "free"  "its"  "kuwait" "market" 
[17] "market," "minister," "mln"  "nazer"  "oil"  "opec"  "prices" "producing" 
[25] "quoted" "recent" "said"  "said."  "saudi"  "sheikh" "spa"  "stick"  
[33] "that"  "the"  "they"  "under"  "was"  "which"  "with"  

[[11]] 
[11] "1.2"  "and"  "appeared" "arabia's" "average" "barrel." "because" "below"  
[9] "bpd"  "but"  "corp"  "crude"  "december" "dlrs"  "export"  "exports" 
[17] "february" "fell"  "for"  "four"  "from"  "gulf"  "january" "january," 
[25] "last"  "mln"  "month"  "month,"  "neutral" "official" "oil"  "opec"  
[33] "output"  "prices"  "production" "refinery" "said"  "said."  "saudi"  "sell"  
[41] "sources" "than"  "the"  "they"  "throughput" "week"  "yanbu"  "zone"  

[[12]] 
[12] "and"  "arab"  "crude"  "emirates" "gulf"  "ministers" "official" "oil"  
[9] "states" "the"  "wam"  

[[13]] 
[13] "accord" "agency" "and" "arabia" "its" "nazer" "oil" "opec" "prices" "saudi" "the" 
[12] "under" 

[[14]] 
[14] "crude" "daily" "for"  "its"  "oil"  "opec" "pumping" "that" "the"  "was"  

[[15]] 
[15] "after" "closed" "new"  "nuclear" "oil"  "plant" "port" "power" "said" "ship" 
[11] "the"  "was"  "when" 

[[16]] 
[16] "about"  "and"   "development" "exploration" "for"   "from"  "help"  
[8] "its"   "mln"   "oil"   "one"   "present"  "prices"  "research" 
[15] "reserve"  "said"  "strategic" "the"   "u.s."  "with"  "would"  

[[17]] 
[17] "about"  "and"   "benefits" "development" "exploration" "for"   "from"  
[8] "group"  "help"  "its"   "mln"   "oil"   "one"   "policy"  
[15] "present"  "prices"  "protect"  "research" "reserve"  "said"  "strategic" 
[22] "study"  "such"  "the"   "u.s."  "with"  "would"  

[[18]] 
[18] "1.50" "company" "crude" "dlrs" "for"  "its"  "lowered" "oil"  "posted" "prices" 
[11] "said" "said." "the"  "union" "west" 

[[19]] 
[19] "according" "and"   "april"  "before"  "can"   "change"  "efp"   
[8] "energy"  "entering"  "exchange"  "for"   "futures"  "has"   "hold"   
[15] "increase"  "into"   "mckiernan" "new"   "not"   "nymex"  "oil"   
[22] "one"   "position"  "prices"  "rule"   "said"   "spokeswoman." "that"   
[29] "the"   "traders"  "transaction" "when"   "will"   

[[20]] 
[20] "1986,"  "1987"   "billion"  "cubic"  "fiscales"  "january"  "mln"   
[8] "pct"   "petroliferos" "yacimientos" 

如果你想要這個輸出一個數據框,你可以這樣做:

# from here http://stackoverflow.com/a/7196565/1036500 
L <- list_freqs 
cfun <- function(L) { 
    pad.na <- function(x,len) { 
    c(x,rep(NA,len-length(x))) 
    } 
    maxlen <- max(sapply(L,length)) 
    do.call(data.frame,lapply(L,pad.na,len=maxlen)) 
} 
# make dataframe of words (but probably you want words as rownames and cells with counts?) 
tab_freqa <- cfun(L) 

但是,如果你想繪製「DOC 1個高頻率方面VS DOC 2高頻率術語」,那麼我們就需要不同的方法...

# convert dtm to matrix 
mat <- as.matrix(dtm) 

# make data frame similar to "3 columns 'Terms', 
# 'Series x', 'Series Y'. With series x and y 
# having the number of times that word occurs" 
cb <- data.frame(doc1 = mat['127',], doc2 = mat['144',]) 

# keep only words that are in at least one doc 
cb <- cb[rowSums(cb) > 0, ] 

# plot 
require(ggplot2) 
ggplot(cb, aes(doc1, doc2)) + 
    geom_text(label = rownames(cb), 
      position=position_jitter()) 

或許稍微更有效,我們可以使所有文檔的一個大的數據幀,使情節:

# this is the typical method to turn a 
# dtm into a df... 
df <- as.data.frame(as.matrix(dtm)) 
# and transpose for plotting 
df <- data.frame(t(df)) 
# plot 
require(ggplot2) 
ggplot(df, aes(X127, X144)) + 
    geom_text(label = rownames(df), 
      position=position_jitter()) 

後刪除禁用詞這樣會更好看,但是這是一個很好的概念驗證。那是你之後的事情嗎?

enter image description here

+0

這真是太棒了。從這個答案中學到很多 – ds10

+1

很高興幫助! 「R」在處理單詞方面非常出色,這是一種專爲數字作品而設計的語言。 – Ben

+0

如果我只想在每個文檔中使用最*頻繁的術語數據框,我該怎麼做?特別是如果我不知道該術語的頻率範圍... – Bryan

0

問題1)我創造,我想用T(docTermMatrix)數據幀中,然後用as.data.frame

dtm.frame <- as.data.frame(t(docTermMatrix)) 
+1

這個唯一的問題是你得到的數字爲colnames,這對很多事情有問題,有沒有陰謀至少'ggplot' – Ben

+0

Ben的回答是要走的路。 – ds10

+0

我希望你不介意,但我已經在這裏寫了答案。我給你充分的信用:http:// paddytherabbit。com/comparison-word-usage-in-text-documents-using-r-some-basics/ – ds10

相關問題