連續非二進制數據的簡單匹配相似度矩陣？

鑑於矩陣連續非二進制數據的簡單匹配相似度矩陣？

structure(list(X1 = c(1L, 2L, 3L, 4L, 2L, 5L), X2 = c(2L, 3L, 
4L, 5L, 3L, 6L), X3 = c(3L, 4L, 4L, 5L, 3L, 2L), X4 = c(2L, 4L, 
6L, 5L, 3L, 8L), X5 = c(1L, 3L, 2L, 4L, 6L, 4L)), .Names = c("X1", 
"X2", "X3", "X4", "X5"), class = "data.frame", row.names = c(NA, 
-6L))

我想創建與匹配的比率與所有列之間的行的總數的5×5的距離矩陣。例如，X4和X3之間的距離應該是0.5，因爲兩列在6次中匹配3次。

我已經嘗試使用軟件包「proxy」中的dist(test, method="simple matching")，但此方法僅適用於二進制數據。

來源

2012-05-24 Werner

使用outer（再次:-)

my.dist <- function(x) { 
n <- nrow(x) 
d <- outer(seq.int(ncol(x)), seq.int(ncol(x)), 
      Vectorize(function(i,j)sum(x[[i]] == x[[j]])/n)) 
rownames(d) <- names(x) 
colnames(d) <- names(x) 
return(d) 
} 

my.dist(x) 
#   X1  X2 X3 X4  X5 
# X1 1.0000000 0.0000000 0.0 0.0 0.3333333 
# X2 0.0000000 1.0000000 0.5 0.5 0.1666667 
# X3 0.0000000 0.5000000 1.0 0.5 0.0000000 
# X4 0.0000000 0.5000000 0.5 1.0 0.0000000 
# X5 0.3333333 0.1666667 0.0 0.0 1.0000000

來源

2012-05-24 04:16:42 flodel

再次感謝！這很好。 – Werner

這裏有一個鏡頭在它（DT是您的矩陣）：

library(reshape) 
df = expand.grid(names(dt),names(dt)) 
df$val=apply(df,1,function(x) mean(dt[x[1]]==dt[x[2]])) 
cast(df,Var2~Var1)

來源

2012-05-24 04:11:12 blindjesse

這很好！非常感謝你。只有一個錯誤：第3行df2 = df。 – Werner

這裏有一個解決方案，比其他兩個快，雖然有點醜陋。我假設速度顛簸來自未使用mean()，因爲它可能比sum()慢，並且也只計算輸出矩陣的一半，然後手動填充下面的三角形。該功能目前離開NA對角線上的，但你可以很容易地設置這些到一個完全其他答案與diag(out) <- 1

FUN <- function(m) { 
    #compute all the combinations of columns pairs 
    combos <- t(combn(ncol(m),2)) 
    #compute the similarity index based on the criteria defined 
    sim <- apply(combos, 1, function(x) sum(m[, x[1]] - m[, x[2]] == 0)/nrow(m)) 
    combos <- cbind(combos, sim) 
    #dimensions of output matrix 
    out <- matrix(NA, ncol = ncol(m), nrow = ncol(m)) 

    for (i in 1:nrow(combos)){ 
    #upper tri 
    out[combos[i, 1], combos[i, 2]] <- combos[i,3] 
    #lower tri 
    out[combos[i, 2], combos[i, 1]] <- combos[i,3] 
    } 
    return(out) 
}

符合我把其他兩個答案，使他們成爲功能，並做了一些基準測試：

library(rbenchmark) 
benchmark(chase(m), flodel(m), blindJessie(m), 
      replications = 1000, 
      order = "elapsed", 
      columns = c("test", "elapsed", "relative")) 
#----- 
     test elapsed relative 
1 chase(m) 1.217 1.000000 
2 flodel(m) 1.306 1.073131 
3 blindJessie(m) 17.691 14.548520

來源

2012-05-24 04:35:54 Chase

Chase，在你的代碼中有一個bug：你在'transform（combos，...）'後面不能使用'combos'，因爲''''會在'combos'裏面被評估。我懷疑你在全球環境中有另一個'combos'副本，所以它適合你。這應該是一個簡單的修復，然後在調用'transform'之前製作組合副本。 – flodel

@ flodel - 好，趕快，謝謝。進行適當的調整並重新計時。堅持矩陣和cbind也加快了功能。 – Chase

那麼你可以再次運行它們，因爲我也提高了答案的速度。在我的機器上，我的版本比你的版本慢了一點，但不是很多：比例降到了1.07。 – flodel

謝謝大家的建議。根據你的回答，我闡述了一個三線解決方案（「測試」是數據集的名稱）。

require(proxy) 
ff <- function(x,y) sum(x == y)/NROW(x) 
dist(t(test), ff, upper=TRUE)

輸出：

  X1  X2  X3  X4  X5 
X1   0.0000000 0.0000000 0.0000000 0.3333333 
X2 0.0000000   0.5000000 0.5000000 0.1666667 
X3 0.0000000 0.5000000   0.5000000 0.0000000 
X4 0.0000000 0.5000000 0.5000000   0.0000000 
X5 0.3333333 0.1666667 0.0000000 0.0000000

來源

2012-05-25 02:49:35 Werner

我無法得到這個工作，'ff'沒有被定義...即使當我改變它爲'f'，它失敗了'錯誤在ascharacter（x）：不能強制類型'關閉'到'character''類型的向量 – Chase

我認爲這是因爲我使用的「dist」函數是package代理的函數。我將在代碼中添加「require（代理）」。 – Werner

我已經得到了答案如下：月1日我已經對行數據進行一些修改爲：

X1 = c(1L, 2L, 3L, 4L, 2L, 5L) 
X2 = c(2L, 3L, 4L, 5L, 3L, 6L) 
X3 = c(3L, 4L, 4L, 5L, 3L, 2L) 
X4 = c(2L, 4L, 6L, 5L, 3L, 8L) 
X5 = c(1L, 3L, 2L, 4L, 6L, 4L) 
matrix_cor=rbind(x1,x2,x3,x4,x5) 
matrix_cor 

    [,1] [,2] [,3] [,4] [,5] [,6] 
X1 1 2 3 4 2 5 
X2 2 3 4 5 3 6 
X3 3 4 4 5 3 2 
X4 2 4 6 5 3 8 
X5 1 3 2 4 6 4

則：

dist(matrix_cor) 

    X1  X2  X3  X4 
X2 2.449490       
X3 4.472136 4.242641     
X4 5.000000 3.000000 6.403124   
X5 4.358899 4.358899 4.795832 6.633250

來源

2017-02-18 14:38:16

嗨。謝謝你的回答：我編輯它，以便代碼可讀。將來，請格式化您的答案以方便閱讀（http://stackoverflow.com/editing-help） – lbusett

連續非二進制數據的簡單匹配相似度矩陣？

回答

相關問題