2014-02-22 57 views
1

我的問題是有點前傳中Visualise distances between texts比較句子或段落的表

問我有一個表有兩個句子來比較每個觀察的問題。

compare <- read.table(header=T,sep="|", text= 
"person | text1 | text2 
person1 | the quick brown fox jumps over the lazy dog | the quick cat jumps on the fast fog 
person2 | I dont want to work today | I feel like working today 
" 
) 

我想要一列,其中的值表示每個觀察的兩個句子之間的差異。 基本上我正在尋找類似於agrep的功能,但用於比較句子或段落。

+0

該示例的預期輸出是多少? –

+0

這有幫助嗎? http://stackoverflow.com/questions/3182091/fast-levenshtein-distance-in-r – Superbest

回答

0

您可以使用adist函數計算字符串之間的差異。 mapply允許您將它應用於所有行:

mapply(adist, compare$text1, compare$text2) 
# [1] 17 15 
0

我不得不學習一點文本挖掘。使用tm我創建了一個函數來比較兩個句子或段落並給出一個數字值。

library(tm) 

dis <- function(text1,text2){ 
#creating a corpus 
text_c <- rbind(text1,text2) 
myCorpus <- Corpus(VectorSource(text_c)) 
#creating a term document matrix 
tdmc <- TermDocumentMatrix(myCorpus, control = list(removePunctuation = TRUE, stopwords=TRUE)) 
#computing dissimilarity 
return(dissimilarity(tdmc, method = "cosine")) 
} 

compare$dis <- mapply(dis, compare$text1, compare$text2) 


person           text1         text2 dis 
person1 the quick brown fox jumps over the lazy dog the quick cat jumps on the fast fog 0.63 
person2      I dont want to work today    I feel like working today 0.75