2016-09-21 51 views
1

我有很多,我想比較文本句子的,但這裏的小紅帽的例子如何計算常用單詞並將結果存儲在矩陣中?

text1 <- "Once upon a time" 
text2 <- "there was a dear little girl" 
text3 <- "who was loved by everyone who looked at her" 

我希望創建一個計算常用的詞就這樣

text1_split <- unlist(strsplit(text1, " ")) 
text2_split <- unlist(strsplit(text2, " ")) 
text3_split <- unlist(strsplit(text3, " ")) 

length(intersect(text1_split, text2_split)) 
length(intersect(text2_split, text3_split)) 

texts <- c("text1","text2","text3") 
data <- data.frame(texts) 
data[, texts] <- NA 
rownames(data) <- texts 
data <- data[,-1] 

data[1,1] <- length(intersect(text1_split, text1_split)) 
data[1,2] <- length(intersect(text1_split, text2_split)) 
data[1,3] <- length(intersect(text1_split, text3_split)) 

矩陣我的矩陣的結果是這樣的

 text1 text2 text3 
text1  4  1  0 
text2 NA NA NA 
text3 NA NA NA 

有沒有辦法以有效的方式完成矩陣?我有超過100個句子來比較。這是類似的,但不等於什麼帖子:Count common words in two strings in R

回答

1

試試這個:

CommonWordsMatrixOld <- function(vList) { 
    v <- lapply(vList, tolower) 
    do.call(rbind, lapply(v, function(x) { 
      xSplit <- strsplit(x, " ")[[1]] 
      do.call(c, lapply(v, function(y) length(intersect(xSplit, strsplit(y, " ")[[1]])))) 
     })) 
} 

myText <- list(text1, text2, text3) 

調用它,我們有:

CommonWordsMatrixOld(myText) 
    [,1] [,2] [,3] 
[1,] 4 1 0 
[2,] 1 6 1 
[3,] 0 1 8 

它是體面快速的數據大小的OP是請求。獲得的數據here

testWords <- read.csv("4000-most-common-english-words-csv.csv", stringsAsFactors = FALSE) 

set.seed(1111) 
myTestText <- lapply(1:100, function(x) { 
     paste(testWords[sample(1000:1020, sample(30, 1), replace = TRUE),],collapse = " ") 
    }) 

myTestText[[15]] 
[1] "access restaurant video opinion video eventually fresh eventually 
reform credit publish judge Senate publish fresh restaurant publish 
version Senate critical release recall relation version" 

system.time(test1 <- CommonWordsMatrixOld(myTestText)) 
user system elapsed 
0.625 0.009 0.646 

這裏是輸出:

test1[1:10,1:10] 
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] 
[1,] 9 3 5 1 3 4 4 2 2  1 
[2,] 3 5 3 1 1 3 3 0 0  1 
[3,] 5 3 12 0 3 8 4 3 2  1 
[4,] 1 1 0 1 0 0 1 0 0  0 
[5,] 3 1 3 0 4 2 1 1 1  0 
[6,] 4 3 8 0 2 13 7 4 1  1 
[7,] 4 3 4 1 1 7 10 4 1  1 
[8,] 2 0 3 0 1 4 4 7 3  0 
[9,] 2 0 2 0 1 1 1 3 4  0 
[10,] 1 1 1 0 0 1 1 0 0  2 

更新

這裏是一個更快的算法,削減了許多不必要的操作,並採取lower.tri而優勢其餘很一般。

CommonWordsMatrixNew <- function(vList) { 
    v <- lapply(vList, function(x) tolower(strsplit(x, " ")[[1]])) 
    s <- length(v) 
    m <- do.call(rbind, lapply(1L:s, function(x) { 
     c(rep(0L,(x-1L)), do.call(c, lapply(x:s, function(y) length(intersect(v[[x]], v[[y]]))))) 
    })) 
    m[lower.tri(m)] <- t(m)[lower.tri(m)] 
    m 
} 

爲了讓你的性能提升一個想法,這裏有一些基準。(應當指出的是,OP的解決方案是不分裂的載體,所以它不是一個真正的比較)。新算法幾乎是OP解決方案的兩倍。

microbenchmark(New=CommonWordsMatrixNew(myTestText), 
       Old=CommonWordsMatrixOld(myTestText), 
       Pach=CommonWordsMatrixPach(PreSplit1), times = 10) 
Unit: milliseconds 
expr  min  lq  mean median  uq  max neval 
New 78.64434 79.07127 86.10754 79.72828 81.39679 137.0695 10 
Old 321.49031 323.89835 326.61801 325.75221 328.50877 335.3306 10 
Pach 138.34742 143.00504 145.35147 145.17376 148.34699 151.5535 10 

identical(CommonWordsMatrixNew(myTestText), CommonWordsMatrixOld(myTestText), CommonWordsMatrixPach(PreSplit1)) 
[1] TRUE 

新算法通過n^2 - n倍減少的呼叫到strsplit數目(例如,在上面的例子中,strplit被調用原始ALGO 10000次,並在更新的版本僅100次)。此外,由於得到的矩陣是對稱的,因此不需要計算每個句子之間的交互超過一次,因此lapply函數中的x = 1:sy = x:s。這些環路的計算次數從n^2減少到nth triangle number= (n*(n+1)/2)(例如,在我們的示例中,從100005050)。之後,我們依靠R的索引功能,這通常比手動製造快得多。

+0

非常感謝,下面我已經發布的東西基於你的回覆,似乎更快 – pachamaltese

+1

@pachamaltese,第一個算法是做不必要的計算。我修改了我原來的算法來刪除很多操作。而且,上面的算法仍然是一般的(即它們不依賴於預分割向量)。順便說一句,很好的問題。 –

+0

好點!我已經在函數本身之前使用strsplit預分割了 – pachamaltese

0

我發現,分裂事先提高速度,使

CommonWordsMatrix <- function(vList) { 
    v <- lapply(vList, tolower) 
    do.call(rbind, lapply(v, function(x) { 
    do.call(c, lapply(v, function(y) length(intersect(x, y)))) 
    })) 
} 

是一個不錯的選擇去(x和y字的預分裂矢量)