2017-08-14 25 views
0

我有一個數據框有兩列,一個用於基因符號,另一個用於功能途徑。通路列具有重複值,因爲每個通路都有許多基因。我想對這個數據集進行重新排序,以便每列都是單一的路徑,這些列中的每一行都是屬於該路徑的基因。轉置與重複數據幀

開始數據幀:

data.frame(pathway = c("p1", "p1", "p1", "p1", "p2", "p2", "p2"), 
gene.symbol = c("G1", "G2", "G3", "G4", "G33", "G43", "G10")) 

希望的數據幀:

data.frame(p1 = c("G1", "G2", "G3", "G4"), p2 = c("G33", "G43", "G10", 
"")) 

我知道,並不是所有的列將是相同的長度,並且具有空白值優選到NAS。

+0

由於列將不具有相同的長度,你真的最好創建一個標準的'list'而不是'data.frame',特別是因爲第1行第1列與第1行第2列無關。 –

回答

0

這是另一種選擇。

  1. 拆分成列表使用通路作爲分離元件
  2. 獲取每一組的最大長度,並設置所有其它基團爲相同長度
  3. 重新打開它爲數據幀

這裏是代碼。

mydf <- data.frame(pathway = c("p1", "p1", "p1", "p1", "p2", "p2", "p2"), 
      gene.symbol = c("G1", "G2", "G3", "G4", "G33", "G43", "G10")) 

# function to run over each element in list 
set_to_max_length <- function(x) { 
    length(x) <- max.length 
    return(x) 
} 

# 1. split into list 
mydf.split <- split(mydf$gene.symbol, mydf$pathway) 

# 2.a get max length of all columns 
max.length <- max(sapply(mydf.split, length)) 

# 2.b set each list element to max length 
mydf.split.2 <- lapply(mydf.split, set_to_max_length) 

# 3. combine back into df 
data.frame(mydf.split.2) 

編輯

下面是使用tidyverse另一種選擇 - 有些更簡潔:

library(tidyverse) 
mydf <- data.frame(pathway = c("p1", "p1", "p1", "p1", "p2", "p2", "p2"), 
        gene.symbol = c("G1", "G2", "G3", "G4", "G33", "G43", "G10")) 

mydf %>% 
    group_by(pathway) %>% 
    mutate(rownum = row_number()) %>% 
    ungroup() %>% 
    spread(pathway, gene.symbol) %>% 
    select(-1) 
0

這似乎是一個有點令人費解,但它首先要列出不是回來data.frame達到所需的輸出:

df$gene.symbol <- as.character(df$gene.symbol) 

pw_list <- list() 
for (pw in unique(df$pathway)) { 
    pw_list[[pw]] <- df[df$pathway == pw, "gene.symbol"] 
} 
pw_list 
$p1 
[1] "G1" "G2" "G3" "G4" 

$p2 
[1] "G33" "G43" "G10" 


reordered <- matrix("", nrow = max(sapply(pw_list, length)), ncol = length(pw_list)) 
colnames(reordered) <- names(pw_list) 

for (pw in names(pw_list)){ 
    n <- length(pw_list[[pw]]) 
    reordered[1:n, pw] <- pw_list[[pw]] 
} 
reordered <- as.data.frame(reordered) 
reordered 
    p1 p2 
1 G1 G33 
2 G2 G43 
3 G3 G10 
4 G4  

編輯

稍微更簡潔的版本:

df$gene.symbol <- as.character(df$gene.symbol) 
pw_list <- list() 
for (pw in unique(df$pathway)) { 
    pw_list[[pw]] <- df[df$pathway == pw, "gene.symbol"] 
} 
reordered <- as.data.frame(sapply(pw_list, "[", i = 1:max(sapply(pw_list, length))), 
          stringsAsFactors = FALSE) 
reordered[is.na(reordered)] <- "" 
names(reordered) <- names(pw_list)