我可以進一步向量化這個功能嗎

我對R和基於矩陣的腳本語言都比較陌生。我已經編寫了這個函數來返回每一行的索引，該行的內容與其他行的內容相似。這是我正在開發的一種垃圾郵件減少的原始形式。我可以進一步向量化這個功能嗎

if (!require("RecordLinkage")) install.packages("RecordLinkage") 

library("RecordLinkage") 

# Takes a column of strings, returns a list of index's 
check_similarity <- function(x) { 
    threshold <- 0.8 
    values <- NULL 
    for(i in 1:length(x)) { 
    values <- c(values, which(jarowinkler(x[i], x[-i]) > threshold)) 
    } 
    return(values) 
}

有沒有一種方法可以寫這個來避免完整的for循環？

來源

2017-02-14 user2228313

@akrun更新，歡呼聲 – user2228313

@Db沒有，我比較反對所有其他行，X [I]，X [-i] – user2228313

也許試試這個：' m = as.matrix（sapply（x，jarowinkler，x））> threshold; diag（m）= 0;哪些（rowSums（m）> 0）'沒有可重複的數據供我測試，但我認爲這是有效的。 – dww

我們可以使用sapply來簡化代碼。

# some test data # 
x = c('hello', 'hollow', 'cat', 'turtle', 'bottle', 'xxx') 

# create an x by x matrix specifying which strings are alike 
m = sapply(x, jarowinkler, x) > threshold 

# set diagonal to FALSE: we're not interested in strings being identical to themselves 
diag(m) = FALSE 

# And find index positions of all strings that are similar to at least one other string 
which(rowSums(m) > 0) 
# [1] 1 2 4 5

即，這將返回的索引位置「你好」，「空洞」，「海龜」和「瓶」爲類似於另一個字符串

如果你願意，你可以使用colSums代替rowSums得到一個名爲向量，但這可能是凌亂如果字符串長：

which(colSums(m) > 0) 
# hello hollow turtle bottle 
#  1  2  4  5

來源

2017-02-14 22:50:18 dww

我可以進一步向量化這個功能嗎

回答

相關問題