2016-03-13 30 views
-1

我有一個文本文件在下面提到。使用R找到共同矩陣使用R

Other methods of contraception were discussed, in the framework of 
a chart which showed both the _expected_ failure rate (theoretical, 
assumes no mistakes) and the _actual_ failure rate (based on research). 
Top of the chart was something like this: 


Method     Expected   Actual 
------     Failure Rate Failure Rate 
Abstinence     0%    0% 


And NFP (Natural Family Planning) was on the bottom. The teacher even 
said, "I've had some students tell me that they can't use anything for 
birth control because they're Catholic. Well, if you're not married and 
you're a practicing Catholic, the *top* of the list is your slot, not 
the *bottom*. Even if you're not religious, the top of the list is 
safest." 

從這個文本文件,我需要找到術語詞彙共現矩陣狀

Correct format required 
     a b c 
    a 0 2 1 
    b 1 0 2 
    c 2 1 0 

是我迄今所做的是我做了一個句子字矩陣狀

sentenc_id words 
    1  a  b  c  d  e 
    2  b  c  f  g  h 
    3  j  k  a  b  c 

與在此問題中提出的相同build word co-occurence edge list in R。但這個答案的格式與我的格式不同。

d <- read.table(text='sentence_id text 
    1   "a b c d e" 
    2   "a b b e" 
    3   "b c d" 
    4   "a e"', header=TRUE, as.is=TRUE) 

    result.vec <- table(unlist(lapply(d$text, function(text) { 
    pairs <- combn(unique(scan(text=text, what='', sep=' ')), m=2) 
    interaction(pairs[1,], pairs[2,]) 
    }))) 

    result <- subset(data.frame(do.call(rbind, strsplit(names(result.vec), 
    '\\.')), freq=as.vector(result.vec)), freq > 0) 
    with(result, result[order(X1, X2),]) 

這就是我使用的是現在的代碼,但它不是做對CO產生矩陣正確的格式,它正在下面的格式。

wrong format 
# X1 X2 freq 
# 1 a b 2 
# 5 a c 1 
# 9 a d 1 
# 13 a e 3 
# 6 b c 2 
# 10 b d 2 
# 14 b e 2 
# 11 c d 2 
# 15 c e 1 
# 16 d e 1 
+0

你預計將顯示你被包括研究庫調用選擇什麼樣的工具,並表明你在編碼做出哪些努力(包括指定的文本對象的名稱,所以我們可以知道什麼是輸入編碼過程。 )。否則,這只是另一個「請爲我做功課」的問題,並會關閉。因此,您需要編輯您的問題以將其提升至SO標準。 –

+0

您還需要在SO和Google上進行一些搜索。對[r]共現的SO搜索給了我超過60次點擊。 –

+0

編輯,你能幫我找到解決方案。最初我沒有添加任何內容,因爲我想知道是否有更好的解決方案。 –

回答

1

由術語文檔矩陣完成它我找到了術語術語共生矩陣。

library(Matrix); 
library(Rcpp); 
#library(wordspace); 
library(NLP); 
library(tm); 
library(qdap); 
library(reshape2); 
library(MASS); 
#library(stringr); 
#library(gtools); 
#library(SnowballC); 

#install.packages("reshape2") 

txt <- system.file("Doc50", "", package = "tm") 

(ovid <- VCorpus(DirSource(txt), 
       readerControl = list(language = "en"))) 

ovid <- tm_map(ovid , removeWords, stopwords("english")) 
ovid <- tm_map(ovid , removePunctuation) 
ovid <- tm_map(ovid , stripWhitespace) 
ovid <- tm_map(ovid, removeNumbers) 

termDocMatrix <- TermDocumentMatrix(ovid) 

termDocMatrix <- as.matrix(termDocMatrix) 

colnamesmdsm <- rownames(termDocMatrix) 
intersected <- intersect(colnamesmdsm,qdapDictionaries::GradyAugmented) 

termDocMatrix <- termDocMatrix[intersected,] 

termDocMatrix[termDocMatrix>=1] <- 1 
# transform into a term-term adjacency matrix 
termMatrix <- termDocMatrix %*% t(termDocMatrix)