高效代碼映射中的R

您好基因型矩陣欲基因型的矩陣轉換，編碼爲三元組來編碼爲0，1，2的矩陣，即高效代碼映射中的R

c(0,0,1) <-> 0; c(0,1,0) <-> 1; c(0,0,1) <-> 2

首先下面是一些代碼，以生成需要減少的矩陣。

# generate genotypes 
expand.G = function(n,p){ 
    probs = runif(n = p) 
    G012.rows = matrix(rbinom(2,prob = probs,n=n*p),nrow = p) 
    colnames(G012.rows) = paste('s',1:n,sep = '') 
    rownames(G012.rows) = paste('g',1:p, sep = '') 
    G012.cols = t(G012.rows) 

    expand.geno = function(g){ 
    if(g == 0){return(c(1,0,0))} 
    if(g == 1){return(c(0,1,0))} 
    if(g == 2){return(c(0,0,1))} 
    } 

    gtype = c() 
    for(i in 1:length(c(G012.cols))){ 
    gtype = c(
     gtype, 
     expand.geno(c(G012.cols)[i]) 
    ) 
    } 

    length(gtype) 

    G = matrix(gtype,byrow = T, nrow = p) 
    colnames(G) = paste('s',rep(1:n,each = 3),c('1','2','3'),sep = '') 
    rownames(G) = paste('g',1:p, sep = '') 
    print(G[1:10,1:15]) 
    print(G012.rows[1:10,1:5]) 

    return(G) 
}

輸出有3n列和p行，其中n是樣本大小，p是基因型數量。現在，我們可以減少矩陣回0,1,2具有以下功能

reduce012 = function(x){ 
    if(identical(x, c(1,0,0))){ 
    return(0) 
    } else if(identical(x, c(0,1,0))){ 
    return(1) 
    } else if(identical(x, c(0,0,1))){ 
    return(2) 
    } else { 
    return(NA) 
    } 
} 

reduce.G = function(G.gen){ 
    G.vec = 
    mapply(function(i,j) reduce012(as.numeric(G.gen[i,(3*j-2):(3*j)])), 
      i=expand.grid(1:(ncol(G.gen)/3),1:nrow(G.gen))[,2], 
      j=expand.grid(1:(ncol(G.gen)/3),1:nrow(G.gen))[,1] 
    ) 

    G = matrix(G.vec, nrow = ncol(G.gen)/3, ncol = nrow(G.gen)) 
    colnames(G) = rownames(G.gen) 
    return(G) 
} 

reduce.G.loop = function(G.gen){ 
    G = matrix(NA,nrow = ncol(G.gen)/3, ncol = nrow(G.gen)) 
    for(i in 1:nrow(G.gen)){ 
    for(j in 1:(ncol(G.gen)/3)){ 
     G[j,i] = reduce012(as.numeric(G.gen[i,(3*j-2):(3*j)])) 
    } 
    } 
    colnames(G) = rownames(G.gen) 
    return(G) 
}

輸出由p列n行的編碼。這是偶然的，但有意識的是，編碼爲0,1,2的矩陣是編碼爲三元組的矩陣的轉置。

代碼不是特別快。困擾我的是時間是n^2。你能解釋還是提供更高效的代碼？

G = expand.G(1000,20) 
system.time(reduce.G(G)) 
system.time(reduce.G.loop(G)) 

G = expand.G(2000,20) 
system.time(reduce.G(G)) 
system.time(reduce.G.loop(G)) 

G = expand.G(4000,20) 
system.time(reduce.G(G)) 
system.time(reduce.G.loop(G))

來源

2017-09-01 user36302

我不能像'all.equal（x，c（1,0,0））'那樣運行你的代碼並不總是返回一個邏輯。你可以用'isTRUE（）'包裝它，但是請提供一個可重複的例子。 –

是的，這是以前的版本，我已更新到相同（）。我會編輯以反映上面。 – user36302

如果你覺得冒險，這裏有一些我寫的awk代碼：https：//github.com/vforget/gen2gemma/blob/master/gen2mgf.awk – Vince

你可以簡單地做一個訪問查找表：

decode <- array(dim = c(3, 3, 3)) 
decode[cbind(1, 0, 0) + 1] <- 0 
decode[cbind(0, 1, 0) + 1] <- 1 
decode[cbind(0, 0, 1) + 1] <- 2

然後，只是做：

matrix(decode[matrix(t(G + 1), ncol = 3, byrow = TRUE)], ncol = nrow(G))

這個全矢量化的R版本會給你相同的矩陣，沒有暗號和超級快。然而，如果你有更大的矩陣，你應該真的使用Rcpp來解決內存和時序問題。

來源

2017-09-01 20:46:25

這似乎是比你的版本（已更名爲reduce.G.orig）快三倍：

reduce.G <- function(G) { 
    varmap = c("100"=0, "010"=1, "001"=2) 
    result <- do.call(rbind, lapply(1:(ncol(G)/3)-1, function(val) 
    varmap[paste(G[,3*val+1], G[,3*val+2], G[,3*val+3], sep="")])) 
    colnames(result) <- rownames(G) 
    result 
} 

system.time(reduce.G(G)) 
# user system elapsed 
# 0.156 0.000 0.155 

system.time(reduce.G.orig(G)) 
# user system elapsed 
# 0.444 0.000 0.441 

identical(reduce.G(G), reduce.G.orig(G)) 
# [1] TRUE

來源

2017-09-01 19:38:06

高效代碼映射中的R

回答

相關問題