2016-08-03 34 views
0

我有這個數據矩陣叫mymat。它已獲得樣本0086000861.GT列。我想用新的.AD列來擴展這個矩陣。相應.AD列針對每個樣品將有值50,0如果.GT0/025/25如果.GT0/10,50如果.GT1/1。我還想添加另一列.DP旁邊的每列將有50橫跨列,並得到result。我如何在R中做這種矩陣的條件擴展?如何擴展相應列名的數據矩陣

mymat <- structure(c("0/1", "1/1", "0/0", "0/0"), .Dim = c(2L, 2L), .Dimnames = list(
c("chr1:1163804", "chr1:1888193" 
), c("00860.GT", "00861.GT"))) 

結果:

  00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP 
chr1:1163804 0/1  25/25  50  0/0  50,0  50 
chr1:1888193 1/1  0/50  50  0/0  50,0  50 
+0

您真的需要處理多少個樣本(列)和chr1(行)? – aichao

+0

@aichao它像2000個樣本列和超過100000行。 – MAPK

回答

1

下面是一個data.table解決方案,每行註釋。它被寫爲處理mymat對象中的任意數量的列。我將簡要解釋一下:

1)首先,我們轉換爲data.table格式,我們可以處理任意數量的列,假設它的格式類似。

2)我們找到所有「.GT」列並提取「.GT」之前的數字。

3)我們爲找到的每個「.GT」列創建「.DP」列。

4)我們通過創建映射的「to」部分的向量來開發「GT」到「AD」映射。 「from」部分作爲名稱存儲在向量中。

5)使用data.table中的.SDcols特性將「GT」應用於「AD」映射,並創建「AD」列。

# Your matrix 
mymat <- structure(c("0/1", "1/1", "0/0", "0/0"), .Dim = c(2L, 2L), 
        .Dimnames = list(c("chr1:1163804", "chr1:1888193"), 
        c("00860.GT", "00861.GT"))) 

# Using a data table approach 
library(data.table) 

# Casting to data table - row.names will be converted to a column called 'rn'. 
mymat = as.data.table(mymat, keep.rownames = T) 

# Find "GT" columns 
GTcols = grep("GT", colnames(mymat)) 

# Get number before ".GT" 
selectedCols = gsub(".GT", "", colnames(mymat)[GTcols]) 

selectedCols 
[1] "00860" "00861" 

# Create ".DP" columns 
mymat[, paste0(selectedCols, ".DP") := 50, with = F] 

mymat 
      rn 00860.GT 00861.GT 00860.DP 00861.DP 
1: chr1:1163804  0/1  0/0  50  50 
2: chr1:1888193  1/1  0/0  50  50 

# Create "GT" to "AD" mapping 
GTToADMapping = c("50,0", "25/25", "0/50") 
names(GTToADMapping) = c("0/0", "0/1", "1/1") 

GTToADMapping 
0/0  0/1  1/1 
"50,0" "25/25" "0/50" 

# This function will return the "AD" mapping given the values of "GT" 
mapGTToAD <- function(x){ 
    return (GTToADMapping[x]) 
} 

# Here, we create the AD columns using the GT mapping 
mymat[, (paste0(selectedCols, ".AD")) := lapply(.SD, mapGTToAD), with = F, 
     .SDcols = colnames(mymat)[GTcols]] 

      rn 00860.GT 00861.GT 00860.DP 00861.DP 00860.AD 00861.AD 
1: chr1:1163804  0/1  0/0  50  50 25/25  50,0 
2: chr1:1888193  1/1  0/0  50  50  0/50  50,0 

# We can sort the data now as you have it 
colOrder = as.vector(rbind(paste0(selectedCols, ".GT"), 
        paste0(selectedCols, ".AD"), 
        paste0(selectedCols, ".DP"))) 
mymat = mymat[, c("rn", colOrder), with = F] 

mymat 
      rn 00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP 
1: chr1:1163804  0/1 25/25  50  0/0  50,0  50 
2: chr1:1888193  1/1  0/50  50  0/0  50,0  50 

# Put it back in the format you had 
mymat2 = as.matrix(mymat[,-1, with = F]) 
rownames(mymat2) = mymat$rn 

mymat2 
      00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP 
chr1:1163804 "0/1" "25/25" "50"  "0/0" "50,0" "50"  
chr1:1888193 "1/1" "0/50" "50"  "0/0" "50,0" "50"  
+0

謝謝。這太棒了。 – MAPK

+0

什麼是警告:''mymat [,paste0(selectedCols,「.DP」):= 50,with = F] 警告信息: 在'[.data.table'(mymat,,':='( paste0(selectedCols,「.DP」),50),: truelength(3141)大於1000項過度分配(長度= 1055)見?truelength。如果你沒有設置datatable.alloccol選項非常大,請將其報告給datatable-help,其中包括sessionInfo()的結果。' – MAPK

+0

這個警告對我來說是新的,請嘗試這裏給出的解決方案(使用'alloc.col'函數):http://stackoverflow.com/問題/ 29615181/r-warning-when-creating-a-long-list-of-dummies並且讓我們知道是否仍然發生警告 – jav

1

有可能是一個更好的辦法,但要做到這一點使用dplyr一個方法是:

library(dplyr) 

set.AD <- function(x) {             ## 1. 
    if (x=="0/0") { 
    return("50/0") 
    } else if (x=="0/1") { 
    return("25/25") 
    } else { 
    return("0/50") 
    } 
} 
mymat <- data.frame(ID=seq_len(nrow(mymat)),mymat)      ## 2. 
rnames <- rownames(mymat) 
out = mymat %>% group_by(ID)            ## 3. 
      %>% mutate(`X00860.AD`=set.AD(`X00860.GT`), `X00860.DP`=50, 
         `X00861.AD`=set.AD(`X00861.GT`), `X00861.DP`=50) 
out <- data.frame(out[,-1])            ## 4. 
rownames(out) <- rnames 

注:

  1. 定義函數,設置AD列根據你的邏輯在GT列。
  2. 將您的數據轉換爲數據框,添加一個唯一標識符列,以便我們可以使用group_by將該函數應用於每一行。還保留行名稱。
  3. 使用mutateX00860.GTX00861.GT列創建ADDP列。請注意,轉換爲數據框前置X的列名稱,因爲R不喜歡以數字開頭的列名稱。有關說明,請參閱此SO answer

此時返回的是tibble。因此,

  1. 刪除ID列,轉換爲數據框,並添加行名稱。

您的數據結果是:

print(out) 
##    X00860.GT X00861.GT X00860.AD X00860.DP X00861.AD X00861.DP 
##chr1:1163804  0/1  0/0  25/25  50  50/0  50 
##chr1:1888193  1/1  0/0  0/50  50  50/0  50 

要重新排序的列到您的輸出匹配,你可以簡單地說:

out <- out[,c(1,3,4,2,5,6)] 
##    X00860.GT X00860.AD X00860.DP X00861.GT X00861.AD X00861.DP 
##chr1:1163804  0/1  25/25  50  0/0  50/0  50 
##chr1:1888193  1/1  0/50  50  0/0  50/0  50 

顯然,這種方法只能處理您的兩列,但可以處理任意數量的行。


編輯處理任意數量的列(樣本)

的注意事項給出意見

# keep column and row names of original mymat to use later 
cnames <- colnames(mymat) 
rnames <- rownames(mymat) 
# since DP columns are always 50, we just create a data frame filled with 50 
# to bind to the result as additional columns 
dp <- data.frame(matrix(rep(50,ncol(mymat)*nrow(mymat)), nrow=nrow(mymat), ncol=ncol(mymat))) 
# set the column name to that of mymat 
colnames(dp) <- cnames 
# convert to data frame and augment with ID as before 
mymat <- data.frame(ID=seq_len(nrow(mymat)),mymat) 
# the difference here is that we use mutate_each to apply set.AD to each 
# (and all) column of the input. This is done in-place. We then bind the 
# original mymat and dp as columns to this result 
out <- mymat %>% group_by(ID) 
      %>% mutate_each(funs(set.AD)) 
      %>% ungroup() %>% select(-ID) 
      %>% bind_cols(mymat[,-1],.) %>% bind_cols(dp) 
# At this point, we have the original mymat columns followed by the 
# AD columns followed by the DP columns. The following uses a matrix 
# transpose trick to resort the columns to what you want 
col.order <- as.vector(t(matrix(seq_len(ncol(out)), nrow=ncol(mymat)-1, ncol=3))) 
out <- data.frame(out[,col.order]) 
# finally, use gsub to change the column names for the AD and DP columns, 
# get rid of the 'X' in the column names, and add back the row names 
colnames(out) <- gsub("X", "", gsub("GT.1", "AD", gsub("GT.2", "DP", colnames(out)))) 
rownames(out) <- rnames 
print(out) 
##    00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP 
##chr1:1163804  0/1 25/25  50  0/0  50/0  50 
##chr1:1888193  1/1  0/50  50  0/0  50/0  50 

希望這有助於。

+0

謝謝。這很有用,但如果有超過2000個樣本呢? – MAPK

+1

您的意思是列。不,這不適用於此。如果你需要的話,我可以看看答案是否可以編輯。 – aichao