2014-01-10 40 views
1

我有特殊格式的字符串數量,表示集合。在R中,我想將它們轉換爲相似性矩陣。將字符串轉換爲相似性矩陣

例如,示出了1 + 2包括一組串,圖3是單獨的一組,和4,5和6包括一組是:

"1+2,3,4+5+6" 

對於上面的例子,我倒是希望能夠產生

 [,1] [,2] [,3] [,4] [,5] [,6] 
[1,] 1 1 0 0 0 0 
[2,] 1 1 0 0 0 0 
[3,] 0 0 1 0 0 0 
[4,] 0 0 0 1 1 1 
[5,] 0 0 0 1 1 1 
[6,] 0 0 0 1 1 1 

看起來這應該是一個痛苦的簡單的任務。我會怎麼做呢?

回答

5

以下是一種方法:

out <- lapply(unlist(strsplit("1+2,3,4+5+6", ",")), function(x) { 
    as.numeric(unlist(strsplit(x, "\\+"))) 
}) 

x <- table(unlist(out), rep(seq_along(out), sapply(out, length))) 

matrix(x %*% t(x), nrow(x)) 

##  [,1] [,2] [,3] [,4] [,5] [,6] 
## [1,] 1 1 0 0 0 0 
## [2,] 1 1 0 0 0 0 
## [3,] 0 0 1 0 0 0 
## [4,] 0 0 0 1 1 1 
## [5,] 0 0 0 1 1 1 
## [6,] 0 0 0 1 1 1 
+1

我編輯的速度更快(即,刪除了不必要的'data.frame'轉換) –

2

僞代碼:

Split at , to get an array of strings, each describing a set. 
For each element of the array: 
    Split at + to get an array of set members 
    Mark every possible pairing of members of this set on the matrix 

在R中可以創建一個矩陣:

m = mat.or.vec(6, 6) 

默認情況下,基體應與所有條目初始化爲0。您可以指定新的值

m[2,3] = 1 
+0

沒錯。而且,雖然我認爲我掌握了僞碼的前三行,但實際上矩陣的標記是我無法包裹頭部。 – Oreotrephes

+0

@Oreotrephes希望這可以更好地解釋它。 – Superbest

+0

當然,謝謝 - 我會用這些組件試試我的手。 – Oreotrephes

1

這裏的另一種方法:

# write a simple function 
similarity <- function(string){ 
    sets <- gsub("\\+", ":", strsplit(string, ",")[[1]]) 
    n <- as.numeric(tail(strsplit(gsub("[[:punct:]]", "", string), "")[[1]], 1)) 
    mat <- mat.or.vec(n, n) 
    ind <- suppressWarnings(lapply(sets, function(x) eval(parse(text=x)))) 

    for(i in 1:length(ind)){ 
    mat[ind[[i]], ind[[i]]] <- 1 
    } 

    return(mat) 

} 

# Use that function 
> similarity("1+2,3,4+5+6") 
    [,1] [,2] [,3] [,4] [,5] [,6] 
[1,] 1 1 0 0 0 0 
[2,] 1 1 0 0 0 0 
[3,] 0 0 1 0 0 0 
[4,] 0 0 0 1 1 1 
[5,] 0 0 0 1 1 1 
[6,] 0 0 0 1 1 1 

# Using other string 
> similarity("1+2,3,5+6+7, 8") 
    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] 
[1,] 1 1 0 0 0 0 0 0 
[2,] 1 1 0 0 0 0 0 0 
[3,] 0 0 1 0 0 0 0 0 
[4,] 0 0 0 0 0 0 0 0 
[5,] 0 0 0 0 1 1 1 0 
[6,] 0 0 0 0 1 1 1 0 
[7,] 0 0 0 0 1 1 1 0 
[8,] 0 0 0 0 0 0 0 1 
+0

太棒了!但是,我不太清楚爲什麼,但似乎這不適用於關閉序列字符串,例如, 「1 + 3,2」 – Oreotrephes