2012-11-21 113 views
2

可能重複:
Find cosine similarity in R計算餘弦相似度,如果數據包含NA值

我有類似這樣的在河,我想找到之間的餘弦相似一張大桌子每個項目,例如對(91,93),(91,99),(91,100)...(101,125)。最終輸出應該是

No_1 No_2 Similarity 
... 
6518 6763 0.974 
… 

表看起來像這樣。

 No_ Product.Group.Code R1 R2 R3 R4 S1 S2 S3 U1 U2 U3 U4 U6 
91 65418    164 0.68 0.70 0.50 0.59 NA NA 0.96 NA 0.68 NA NA NA 
93 57142    164 NA 0.94 NA NA 0.83 NA NA 0.54 NA NA NA NA 
99 66740    164 0.68 0.68 0.74 NA 0.63 0.68 0.72 NA NA NA NA NA 
100 76712    164 0.54 0.54 0.40 NA 0.39 0.39 0.39 0.50 NA 0.50 NA NA 
101 56463    164 0.67 0.67 0.76 NA NA 0.76 0.76 0.54 NA NA NA NA 
125 11713    164 NA NA NA NA NA 0.88 NA NA NA NA NA NA 

因爲有些行有NA,我寫了一些輔助功能,只比較其中兩排的不是NA列。

compareNA <- function(v1,v2) { 
    same <- (!is.na(v1) & !is.na(v2)) 
    same[is.na(same)] <- FALSE 
    return(same) 
} 

selectTRUE <- function(v1, truth) { 
    # This function selects only the variables which correspond to the truth vector 
    # being true. 
    for (colname in colnames(v1)) { 
     if(!truth[ ,colname]) { 
      v1[colname] <- NULL 
     } 
    } 
    return(v1) 
} 

trimAndTuck <- function(v1){ 
    # Turns list into vector and removes first two columns 
    return (unlist(v1, use.names = FALSE)[-(1:2)]) 
} 

cosineSimilarity <- function(v1, v2) { 
    truth <- compareNA(v1, v2) 
    return (cosine(
       trimAndTuck(selectTRUE(v1, truth)), 
       trimAndTuck(selectTRUE(v2, truth)) 
       )) 
} 

allPairs <- function(df){ 
    for (i in 1:length(df)) { 
     for (j in 1:length(df)) { 
      print(cosineSimilarity(df[i,], df[j,])) 
     } 
    } 
} 

運行allpairs不給我正確的答案,但它在一系列的1x1載體這樣做。我很清楚,我寫的東西可能是對功能性神的侮辱,但我不知道要如何寫它。

如何重寫(矢量化?)以便它以正確的格式返回數據?

編輯:我使用的餘弦函數是LSA包的一部分。這是關於用餘弦函數處理NA值,而不是如何計算標準餘弦相似度。

+1

函數R包中的餘弦函數? –

+1

也許這[so-question](http://stackoverflow.com/questions/2535234/find-cosine-similarity-in-r)(可能重複)會有所幫助。按照接受的答案的指示。 – sgibb

+0

這不是重複的,因爲它是關於如何在使用餘弦函數時處理NA值。 – Roland

回答

3
#data 
df <- read.table(text="No_ Product.Group.Code R1 R2 R3 R4 S1 S2 S3 U1 U2 U3 U4 U6 
91 65418    164 0.68 0.70 0.50 0.59 NA NA 0.96 NA 0.68 NA NA NA 
93 57142    164 NA 0.94 NA NA 0.83 NA NA 0.54 NA NA NA NA 
99 66740    164 0.68 0.68 0.74 NA 0.63 0.68 0.72 NA NA NA NA NA 
100 76712    164 0.54 0.54 0.40 NA 0.39 0.39 0.39 0.50 NA 0.50 NA NA 
101 56463    164 0.67 0.67 0.76 NA NA 0.76 0.76 0.54 NA NA NA NA 
125 11713    164 NA NA NA NA NA 0.88 NA NA NA NA NA NA",header=TRUE) 

#remove second column 
df <- df[,-2] 

#transform to long format 
library(reshape2) 
df <- melt(df,id.vars="No_") 

#cosine similarity taken from package lsa 
#I could not load package lsa, because I lack Java on my system 
cosine <- function(x, y=NULL) { 

    if (is.matrix(x) && is.null(y)) { 

    co = array(0,c(ncol(x),ncol(x))) 
    f = colnames(x) 
    dimnames(co) = list(f,f) 

    for (i in 2:ncol(x)) { 
     for (j in 1:(i-1)) { 
     co[i,j] = cosine(x[,i], x[,j]) 
     } 
    } 
    co = co + t(co) 
    diag(co) = 1 

    return (as.matrix(co)) 

    } else if (is.vector(x) && is.vector(y)) { 
    return (crossprod(x,y)/sqrt(crossprod(x)*crossprod(y))) 
    } else { 
    stop("argument mismatch. Either one matrix or two vectors needed as input.") 
    } 

} 

#define function that removes NA before calculating the similarity 
cosine2 <- function(x,y) cosine(na.omit(cbind(x,y))) 

#pairwise comparisons 
i <- outer(unique(df$No_),unique(df$No_),FUN=function(i,j) i) 
j <- outer(unique(df$No_),unique(df$No_),FUN=function(i,j) j) 

i <- i[!lower.tri(i)] 
j <- j[!lower.tri(j)] 

comp <- function(ind) { 
    res <- cosine2(df$value[df$No_==i[ind]],df$value[df$No_==j[ind]])[1,2] 
    list(No1=as.character(i[ind]),No2=as.character(j[ind]),CosSim=res) 
} 

res <- as.data.frame(t(sapply(seq_along(i),FUN="comp"))) 

    No1 No2 CosSim 
1 65418 65418   1 
2 65418 57142   1 
3 57142 57142   1 
4 65418 66740 0.9724159 
5 57142 66740 0.999714 
6 66740 66740   1 
7 65418 76712 0.9569313 
8 57142 76712 0.9684678 
9 66740 76712 0.9854669 
10 76712 76712   1 
11 65418 56463 0.9741412 
12 57142 56463 0.9877108 
13 66740 56463 0.9989167 
14 76712 56463 0.9708716 
15 56463 56463   1 
16 65418 11713  NaN 
17 57142 11713  NaN 
18 66740 11713   1 
19 76712 11713   1 
20 56463 11713   1 
21 11713 11713   1