提取值在兩個不同的列中匹配在R

其他列我有這個矩陣mymat（大約446664×234）。它有REF和列，他們可以有任何A，T，G，C字母（只有一個字母）。在以.GT結尾的列中，我想要替換這些字母。要匹配的條件是，如果有0，我想用REF列中的字母替換它，如果有1，那麼我想用ALT列中的字母替換它。如果有NA，我想用「0」「0」（即零空間零）代替它。最後，我需要反轉所有行（轉置）中的.GT列，如結果中所示。在結果中，一切都由空間分隔。提取值在兩個不同的列中匹配在R

mymat<-structure(c("G", "A", "C", "A", "G", "A", "C", "T", "G", "A", 
"1/1", "0/0", "0/0", "NA", "NA", "0,15", "8,0", "8,0", "NA", 
"NA", "1/1", "0/1", "0/0", "NA", "NA", "0,35", "12,12", "15,0", 
"NA", "NA"), .Dim = 5:6, .Dimnames = list(c("chrX:133511988:133511988:G:A:snp", 
"chrX:133528116:133528116:A:C:snp", "chrX:133528186:133528186:C:T:snp", 
"chrX:133560301:133560301:A:G:snp", "chrX:133561242:133561242:G:A:snp" 
), c("REF", "ALT", "02688.GT", "02688.AD", "02689.GT", "02689.AD" 
)))

結果

02688.GT A A A A C C 0 0 0 0 
02689.GT A A A C C C 0 0 0 0

來源

2015-09-21 MAPK

如果一列有缺失值，那麼所有列都有缺失值？ – atiretoo

@atiretoo並非如此，它獨立於任何列，並且可以具有任何價值。 – MAPK

那麼結果中的行可以有不同的長度？ – atiretoo

你可以嘗試：

library(dplyr) 
library(stringi) 

## convert to data.frame 
data.frame(mymat, check.names = FALSE) %>% 
    ## replace the values ("0", "1", "/", "NA") in all columns ending with ".GT" with 
    ## the corresponding values in "REF" and "ALT" (" " for "/" and "0 0" for "NA") 
    mutate_each(funs(stri_replace_all(., REF, fixed = "0")), ends_with(".GT")) %>% 
    mutate_each(funs(stri_replace_all(., ALT, fixed = "1")), ends_with(".GT")) %>% 
    mutate_each(funs(stri_replace_all(., " ", fixed = "/")), ends_with(".GT")) %>% 
    mutate_each(funs(stri_replace_all(., "0 0", fixed = "NA")), ends_with(".GT")) %>% 
    ## keep only the columns ending with ".GT" 
    select(ends_with(".GT")) %>% 
    ## transpose the results 
    t()

其中給出：

  [,1] [,2] [,3] [,4] [,5] 
02688.GT "A A" "A A" "C C" "0 0" "0 0" 
02689.GT "A A" "A C" "C C" "0 0" "0 0"

來源

2015-09-21 01:19:50

謝謝，但有沒有辦法做到這一點，而無需將其更改爲數據框（僅使用矩陣）。 – MAPK

你能解釋一下這些步驟嗎？ – MAPK

@MAPK是的，它可以做到這一點，而無需轉換爲data.frame，但這是我首先想到的解決方案。歡迎其他人發佈不同的方法。請參閱編輯以瞭解每個步驟的說明。 –

我張貼我自己的答案，但實在是太慢了所以需要進一步優化。

 letters <- strsplit(paste(mymat[,"REF"],mymat[,"ALT"],sep=","),",") # concatenate the letters to have an index to work on from the numbers 
values <- t(mymat[,c(which(colnames(mymat)%in%lapply(all.samples,function(x)(paste(x,"GT",sep=".")))))]) # working on each column needing values 
nbval <- ncol(values) # Keeping track of total number of columns and saving the length of values 

#Preparing the two temp vectors to be used below 
chars <- vector("character",2) 
ret <- vector("character",nbval) 

#Loop over the rows (and transpose the result) 
mydata<-t(sapply(rownames(values), 
       function(x) { 
        indexes <- strsplit(values[x,],"/") # Get a list with pairs of indexes 

        for(i in 1:nbval) { # Loop over the number of columns :/ 
        for (j in 1:2) { # Loop over the pair 
         chars[j] <- ifelse(indexes[i] == "NA", 0,letters[[i]][as.integer(indexes[[i]][j])+1]) # Get '0' if "NA" or the letter with the correct index at this postion 
        } 
        ret[i] <- paste(chars[1],chars[2], sep=" ") # concatenate the two chars 
        } 
        return(ret) # return this for this row 
       } 
))

來源

2015-09-21 01:54:34 MAPK

所以這只是一個部分答案，我不知道它將如何使用> 200000行。但也許有人更聰明會想出如何更好地做到這一點。

temp1 = strsplit(mymat[,3],"/") 
reps = sapply(temp1,length) 
refalt = data.frame(REF = rep(mymat[,1],times=reps),ALT = rep(mymat[,2],times=reps),ZERO = "0 0") 
GT1 = unlist(temp1) 
GT1[GT1=="NA"] = "2" 
GT1 = as.numeric(GT1)+1 
paste(refalt[cbind(1:8,GT1)]," ")

它是不完整的，因爲我們需要把它包起來，可以通過申請（）或lapply（）的函數，並在該行的開始捕獲變量名。

來源

2015-09-21 02:09:34 atiretoo

提取值在兩個不同的列中匹配在R

回答

相關問題