2016-05-03 34 views
0

我有超過200列的data.frame,幷包括以下包括有關這個問題列一個子集:獨特的組合,基於標準從一行

>df 
Variant Pos  ID DB.0.count DB.1.count sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 
variant5 1234567 A  5    5    1/0  1/0  1/0  1/1  1/1  0/0  1/0  0/0  1/0  1/1 
.   .  .  .    .    F1   F1   F1   F2   F2   F3   F4   F4   F4   F5 

我想:

1.使samples1-sample10列,其中每個組合包含來自每個F數一個樣品,即,每個組合包含5個樣品從F1,F2,F3,F4,F5每一個樣品的所有可能的組合。

所以在上面的實例中會有18點的組合,例如:

第一組合將是SAMPLE1,sample4,sample6,sample7,sample10

第二組合是SAMPLE1,sample4,sample6 ,樣品8,sample10

第三組合是SAMPLE1,sample4,sample6,sample9,sample10

我與uniqueduplicated和0123發揮各地閱讀相關帖子後,卻沒有任何地方。

然後,我想輸出每個獨特的組合到一個新的data.frame,對樣本中的樣本中的每個變量執行計數,並將結果輸出到新列,然後執行下面的Fisher精確測試並輸出到新列,下面,將下面的代碼應努力做到:(費代碼在這裏瞭解到:Fisher's exact test on values from large dataframe and bypassing errors

df.combo.1$pop.0/0.count <- apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("0/0",u))==TRUE))  
df.combo.1$pop.1/0.count <- apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("1/0",u))==TRUE)) 
df.combo.1$pop.1/1.count <- apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("1/1",u))==TRUE)) 

df.combo.1$pop.0.count <- (2*(apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("0/0",u))==TRUE))) + apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("0/1",u))==TRUE))) 
df.combo.1$pop.1.count <- (2*(apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("1/1",u))==TRUE))) + apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("0/1",u))==TRUE))) 

res <- NULL 
for (i in 1:nrow(df.combo.1)){ 
table <- matrix(c(df.combo.1[i, 4], df.combo.1[i, 5], df.combo.1[i, 14], df.combo.1[i, 15]), ncol = 2, byrow = TRUE) 
# if any NA occurs in your table save an error in p else run the fisher test 
if(any(is.na(table))) p <- "error" else p <- fisher.test(table)$p.value 
# save all p values in a vector 
res <- c(res,p) 
} 
df.combo.1$fishers <- res 


>df.combo.1 
Variant Pos  ID DB.0.count DB.1.count sample1 sample4 sample6 sample7 sample10 pop.0/0.count pop.1/0.count pop.1/1.count pop.0.count pop.1.count  fishers 
variant5 1234567 A  5    5    1/0  1/1  0/0  1/0  1/1  1    2    2    4    6    1.0000 
.   .  .  .    .    F1   F2   F3   F4   F5 

2.最後,我想創建一個data.frame,其中列出了每一個獨特的組合Fisher精確p值如下:

>new.df 
combo fishers 
1  1.0000 
2  1.0000 
3  1.0000 
4  1.0000 
etc 

我認爲這整個練習可能需要某種for循環?

回答

1

我想我已經掌握了你想要的東西。對於我認爲你在第1部分中掙扎的那部分,我使用了其中的組合和expand.grid來整理。

對於第2部分來說,一旦數據按照每個觀察基準排列在1行上,該部分就是一個相當容易的分組。

它看起來像你每個觀察使用2行(除非這只是一個格式化的東西),這使得它非常困難(但不是不可能,只需要更多的雜耍),所以我已經將數據組合到一行中。這應該是一個非常簡單的轉換,只需將每個「第二」行中的相應列附加到每個「第一」行,然後刪除每一行。

這可以做得更有效率和整潔,但我認爲這是有效的,應該相當容易地擴展到其他情況。

問候, 喬希

# provided demo data 
# Variant Pos  ID DB.0.count DB.1.count sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 
# variant5 1234567 A  5    5    1/0  1/0  1/0  1/1  1/1  0/0  1/0  0/0  1/0  1/1 
# .   .  .  .    .    F1   F1   F1   F2   F2   F3   F4   F4   F4   F5 


# create data frame in long format 
test.df <- as.data.frame(t(c("variant5",1234567,"A",5,5,"1/0","1/0","1/0","1/1","1/1","0/0","1/0","0/0","1/0","1/1","F1", "F1", "F1", "F2", "F2", "F3", "F4", "F4", "F4", "F5"))) 
# ensure as character format 
test.df[] <- lapply(test.df, as.character) 
# get postions of "F" data 
F1.var <- which(test.df =="F1") 
F2.var <- which(test.df =="F2") 
F3.var <- which(test.df =="F3") 
F4.var <- which(test.df =="F4") 
F5.var <- which(test.df =="F5") 
# get all combinations of the 5 F positions 
Fcode.combinations <- expand.grid(F1.var,F2.var,F3.var,F4.var,F5.var) 
# create results data frame 
df.combo.1 <- as.data.frame(matrix(NA,ncol = 21, nrow = nrow(Fcode.combinations))) 
# name variables 
names(df.combo.1) <- c("Variant","Pos","ID","DB.0.count","DB.1.count", 
           "F1.sample.pos","F1.result", 
           "F2.sample.pos","F2.result", 
           "F3.sample.pos","F3.result", 
           "F4.sample.pos","F4.result", 
           "F5.sample.pos","F5.result", 
           "pop.0_0.count","pop.1_0.count","pop.1_1.count", 
           "pop.0.count","pop.1.count", 
           "fishers") 
# copy in common data 
df.combo.1[,1:5] <- test.df[,1:5] 
# setup variables based on combination data 
for(i in 1:nrow(Fcode.combinations)){ 
    df.combo.1[i,c(6,8,10,12,14)] <- Fcode.combinations[i,] 
    # -10 to correct for the position of the results not the 'F type' data 
    cycle.results <- as.numeric(Fcode.combinations[i,] -10) 
    df.combo.1[i,c(7,9,11,13,15)] <- test.df[cycle.results] 
} 

# this is essentially your code with the column reference changed 

df.combo.1$pop.0_0.count <- apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("0/0",u))==TRUE))  
df.combo.1$pop.1_0.count <- apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("1/0",u))==TRUE)) 
df.combo.1$pop.1_1.count <- apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("1/1",u))==TRUE)) 

df.combo.1$pop.0.count <- (2*(apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("0/0",u))==TRUE))) + apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("0/1",u))==TRUE))) 
df.combo.1$pop.1.count <- (2*(apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("1/1",u))==TRUE))) + apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("0/1",u))==TRUE))) 

res <- NULL 
for (i in 1:nrow(df.combo.1)){ 
    table <- matrix(as.numeric(c(df.combo.1[i, 4], df.combo.1[i, 5], df.combo.1[i, 16], df.combo.1[i, 17])), ncol = 2, byrow = TRUE) 
    # if any NA occurs in your table save an error in p else run the fisher test 
    if(any(is.na(table))) p <- "error" else p <- fisher.test(table)$p.value 
    # save all p values in a vector 
    res <- c(res,p) 
} 
df.combo.1$fishers <- res 

# create results data 
df.combo.1.results <- as.data.frame(cbind(1:nrow(df.combo.1),df.combo.1$fishers)) 
names(df.combo.1.results) <- c("combo","fishers") 
+0

完美,太感謝你了! – emily