如何擴展相應列名的數據矩陣

我有這個數據矩陣叫mymat。它已獲得樣本00860和00861的.GT列。我想用新的.AD列來擴展這個矩陣。相應.AD列針對每個樣品將有值50,0如果.GT是0/0，25/25如果.GT是0/1和0,50如果.GT是1/1。我還想添加另一列.DP旁邊的每列將有50橫跨列，並得到result。我如何在R中做這種矩陣的條件擴展？如何擴展相應列名的數據矩陣

mymat <- structure(c("0/1", "1/1", "0/0", "0/0"), .Dim = c(2L, 2L), .Dimnames = list(
c("chr1:1163804", "chr1:1888193" 
), c("00860.GT", "00861.GT")))

結果：

  00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP 
chr1:1163804 0/1  25/25  50  0/0  50,0  50 
chr1:1888193 1/1  0/50  50  0/0  50,0  50

來源

2016-08-03 MAPK

您真的需要處理多少個樣本（列）和chr1（行）？ – aichao

@aichao它像2000個樣本列和超過100000行。 – MAPK

下面是一個data.table解決方案，每行註釋。它被寫爲處理mymat對象中的任意數量的列。我將簡要解釋一下：

1）首先，我們轉換爲data.table格式，我們可以處理任意數量的列，假設它的格式類似。

2）我們找到所有「.GT」列並提取「.GT」之前的數字。

3）我們爲找到的每個「.GT」列創建「.DP」列。

4）我們通過創建映射的「to」部分的向量來開發「GT」到「AD」映射。「from」部分作爲名稱存儲在向量中。

5）使用data.table中的.SDcols特性將「GT」應用於「AD」映射，並創建「AD」列。

# Your matrix 
mymat <- structure(c("0/1", "1/1", "0/0", "0/0"), .Dim = c(2L, 2L), 
        .Dimnames = list(c("chr1:1163804", "chr1:1888193"), 
        c("00860.GT", "00861.GT"))) 

# Using a data table approach 
library(data.table) 

# Casting to data table - row.names will be converted to a column called 'rn'. 
mymat = as.data.table(mymat, keep.rownames = T) 

# Find "GT" columns 
GTcols = grep("GT", colnames(mymat)) 

# Get number before ".GT" 
selectedCols = gsub(".GT", "", colnames(mymat)[GTcols]) 

selectedCols 
[1] "00860" "00861" 

# Create ".DP" columns 
mymat[, paste0(selectedCols, ".DP") := 50, with = F] 

mymat 
      rn 00860.GT 00861.GT 00860.DP 00861.DP 
1: chr1:1163804  0/1  0/0  50  50 
2: chr1:1888193  1/1  0/0  50  50 

# Create "GT" to "AD" mapping 
GTToADMapping = c("50,0", "25/25", "0/50") 
names(GTToADMapping) = c("0/0", "0/1", "1/1") 

GTToADMapping 
0/0  0/1  1/1 
"50,0" "25/25" "0/50" 

# This function will return the "AD" mapping given the values of "GT" 
mapGTToAD <- function(x){ 
    return (GTToADMapping[x]) 
} 

# Here, we create the AD columns using the GT mapping 
mymat[, (paste0(selectedCols, ".AD")) := lapply(.SD, mapGTToAD), with = F, 
     .SDcols = colnames(mymat)[GTcols]] 

      rn 00860.GT 00861.GT 00860.DP 00861.DP 00860.AD 00861.AD 
1: chr1:1163804  0/1  0/0  50  50 25/25  50,0 
2: chr1:1888193  1/1  0/0  50  50  0/50  50,0 

# We can sort the data now as you have it 
colOrder = as.vector(rbind(paste0(selectedCols, ".GT"), 
        paste0(selectedCols, ".AD"), 
        paste0(selectedCols, ".DP"))) 
mymat = mymat[, c("rn", colOrder), with = F] 

mymat 
      rn 00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP 
1: chr1:1163804  0/1 25/25  50  0/0  50,0  50 
2: chr1:1888193  1/1  0/50  50  0/0  50,0  50 

# Put it back in the format you had 
mymat2 = as.matrix(mymat[,-1, with = F]) 
rownames(mymat2) = mymat$rn 

mymat2 
      00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP 
chr1:1163804 "0/1" "25/25" "50"  "0/0" "50,0" "50"  
chr1:1888193 "1/1" "0/50" "50"  "0/0" "50,0" "50"

來源

2016-08-03 02:27:02 jav

謝謝。這太棒了。 – MAPK

什麼是警告：''mymat [，paste0（selectedCols，「.DP」）：= 50，with = F] 警告信息：在'[.data.table'（mymat，，'：='（ paste0（selectedCols，「.DP」），50），： truelength（3141）大於1000項過度分配（長度= 1055）見？truelength。如果你沒有設置datatable.alloccol選項非常大，請將其報告給datatable-help，其中包括sessionInfo（）的結果。' – MAPK

這個警告對我來說是新的，請嘗試這裏給出的解決方案（使用'alloc.col'函數）：http://stackoverflow.com/問題/ 29615181/r-warning-when-creating-a-long-list-of-dummies並且讓我們知道是否仍然發生警告 – jav

有可能是一個更好的辦法，但要做到這一點使用dplyr一個方法是：

library(dplyr) 

set.AD <- function(x) {             ## 1. 
    if (x=="0/0") { 
    return("50/0") 
    } else if (x=="0/1") { 
    return("25/25") 
    } else { 
    return("0/50") 
    } 
} 
mymat <- data.frame(ID=seq_len(nrow(mymat)),mymat)      ## 2. 
rnames <- rownames(mymat) 
out = mymat %>% group_by(ID)            ## 3. 
      %>% mutate(`X00860.AD`=set.AD(`X00860.GT`), `X00860.DP`=50, 
         `X00861.AD`=set.AD(`X00861.GT`), `X00861.DP`=50) 
out <- data.frame(out[,-1])            ## 4. 
rownames(out) <- rnames

注：

定義函數，設置AD列根據你的邏輯在GT列。
將您的數據轉換爲數據框，添加一個唯一標識符列，以便我們可以使用group_by將該函數應用於每一行。還保留行名稱。
使用mutate爲X00860.GT和X00861.GT列創建AD和DP列。請注意，轉換爲數據框前置X的列名稱，因爲R不喜歡以數字開頭的列名稱。有關說明，請參閱此SO answer。

此時返回的是tibble。因此，

刪除ID列，轉換爲數據框，並添加行名稱。

您的數據結果是：

print(out) 
##    X00860.GT X00861.GT X00860.AD X00860.DP X00861.AD X00861.DP 
##chr1:1163804  0/1  0/0  25/25  50  50/0  50 
##chr1:1888193  1/1  0/0  0/50  50  50/0  50

要重新排序的列到您的輸出匹配，你可以簡單地說：

out <- out[,c(1,3,4,2,5,6)] 
##    X00860.GT X00860.AD X00860.DP X00861.GT X00861.AD X00861.DP 
##chr1:1163804  0/1  25/25  50  0/0  50/0  50 
##chr1:1888193  1/1  0/50  50  0/0  50/0  50

顯然，這種方法只能處理您的兩列，但可以處理任意數量的行。

編輯處理任意數量的列（樣本）

的注意事項給出意見

# keep column and row names of original mymat to use later 
cnames <- colnames(mymat) 
rnames <- rownames(mymat) 
# since DP columns are always 50, we just create a data frame filled with 50 
# to bind to the result as additional columns 
dp <- data.frame(matrix(rep(50,ncol(mymat)*nrow(mymat)), nrow=nrow(mymat), ncol=ncol(mymat))) 
# set the column name to that of mymat 
colnames(dp) <- cnames 
# convert to data frame and augment with ID as before 
mymat <- data.frame(ID=seq_len(nrow(mymat)),mymat) 
# the difference here is that we use mutate_each to apply set.AD to each 
# (and all) column of the input. This is done in-place. We then bind the 
# original mymat and dp as columns to this result 
out <- mymat %>% group_by(ID) 
      %>% mutate_each(funs(set.AD)) 
      %>% ungroup() %>% select(-ID) 
      %>% bind_cols(mymat[,-1],.) %>% bind_cols(dp) 
# At this point, we have the original mymat columns followed by the 
# AD columns followed by the DP columns. The following uses a matrix 
# transpose trick to resort the columns to what you want 
col.order <- as.vector(t(matrix(seq_len(ncol(out)), nrow=ncol(mymat)-1, ncol=3))) 
out <- data.frame(out[,col.order]) 
# finally, use gsub to change the column names for the AD and DP columns, 
# get rid of the 'X' in the column names, and add back the row names 
colnames(out) <- gsub("X", "", gsub("GT.1", "AD", gsub("GT.2", "DP", colnames(out)))) 
rownames(out) <- rnames 
print(out) 
##    00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP 
##chr1:1163804  0/1 25/25  50  0/0  50/0  50 
##chr1:1888193  1/1  0/50  50  0/0  50/0  50

希望這有助於。

來源

2016-08-03 02:03:26 aichao

謝謝。這很有用，但如果有超過2000個樣本呢？ – MAPK

您的意思是列。不，這不適用於此。如果你需要的話，我可以看看答案是否可以編輯。 – aichao

如何擴展相應列名的數據矩陣

回答

相關問題