2017-05-04 63 views
3

我有兩個數據幀:如何從兩個數據分組值之間進行操作框架


src_tbl <- structure(list(Sample_name = c("S1", "S2", "S1", "S2", "S1", 
"S2"), crt = c(0.079, 0.082, 0.079, 0.082, 0.079, 0.082), sr = c(0.592, 
0.549, 0.592, 0.549, 0.592, 0.549), condition = c("x1", "x1", 
"x2", "x2", "x3", "x3"), score = c("0.077", "0.075", "0.483", 
"0.268", "0.555", "0.120")), row.names = c(NA, -6L), .Names = c("Sample_name", 
"crt", "sr", "condition", "score"), class = c("tbl_df", 
"tbl", "data.frame")) 
src_tbl 
#> Sample_name crt sr condition score 
#> 1   S1 0.079 0.592  x1 0.077 
#> 2   S2 0.082 0.549  x1 0.075 
#> 3   S1 0.079 0.592  x2 0.483 
#> 4   S2 0.082 0.549  x2 0.268 
#> 5   S1 0.079 0.592  x3 0.555 
#> 6   S2 0.082 0.549  x3 0.120 

ref_tbl <- structure(list(Sample_name = c("P1", "P2", "P3", "P1", "P2", 
"P3", "P1", "P2", "P3"), crt = c(1, 1, 1, 1, 1, 1, 1, 1, 1), 
    sr = c(2, 2, 2, 2, 2, 2, 2, 2, 2), condition = c("r1", "r1", 
    "r1", "r2", "r2", "r2", "r3", "r3", "r3"), score = c("0.200", 
    "0.201", "0.199", "0.200", "0.202", "0.200", "0.200", "0.204", 
    "0.197")), row.names = c(NA, -9L), .Names = c("Sample_name", 
"crt", "sr", "condition", "score"), class = c("tbl_df", 
"tbl", "data.frame")) 
ref_tbl 
#> Sample_name crt sr condition score 
#> 1   P1 1 2  r1 0.200 
#> 2   P2 1 2  r1 0.201 
#> 3   P3 1 2  r1 0.199 
#> 4   P1 1 2  r2 0.200 
#> 5   P2 1 2  r2 0.202 
#> 6   P3 1 2  r2 0.200 
#> 7   P1 1 2  r3 0.200 
#> 8   P2 1 2  r3 0.204 
#> 9   P3 1 2  r3 0.197 

我想要做的是執行對分組score列操作(ks.test())在兩個數據幀中均爲Sample_name。例如KS檢驗S1和P1的p值:


# in src_tbl 
s1 <- c(0.077,0.483,0.555) 
#in ref_tbl 
p1 <- c(0.200,0.200,0.200) 
testout <- ks.test(s1,p1) 
#> Warning in ks.test(s1, p1): cannot compute exact p-value with ties 
broom::tidy(testout) 
#> statistic p.value        method alternative 
#> 1 0.6666667 0.5175508 Two-sample Kolmogorov-Smirnov test two-sided 

我想這樣做是爲了讓最終,我們得到的表像,以對所有的操作全部執行這

src ref p.value 
S1 P1 0.5175508 
S1 P2 0.6 
S1 P3 0.6 
S2 P1 0.5175508 
S2 P2 0.6 
S2 P3 0.6 

我該怎麼做?由於ref_table中的樣本數可能很大(P1,P2 .... P10k),所以優先選擇較快。

+0

兩個數據幀的長度是不一樣的嗎? –

+0

@ J.con'src_tbl'和'ref_tbl'可以具有相同或不同的維度。 – pdubois

回答

3

這裏是在tidyverse的溶液。我第一次窩的比分在每個源數據集:

ref_tbl <- ref_tbl %>% 
    mutate(ref = as.factor(Sample_name), 
     score_ref = as.numeric(score)) %>% 
    select(ref, score_ref) %>% 
    tidyr::nest(score_ref) 

ref_tbl 
# A tibble: 3 x 2 
    ref     data 
    <fctr>     <list> 
1  P1 <tibble [3 x 1]> 
2  P2 <tibble [3 x 1]> 
3  P3 <tibble [3 x 1]> 

src_tbl <- src_tbl %>% 
    mutate(src = as.factor(Sample_name), 
     score_src = as.numeric(score)) %>% 
    select(src, score_src) %>% 
    tidyr::nest(score_src) 

src_tbl 
# A tibble: 2 x 2 
    src     data 
    <fctr>     <list> 
1  S1 <tibble [3 x 1]> 
2  S2 <tibble [3 x 1]> 

然後我創建了一個網格,樣品名稱的所有組合:

all_comb <- as_data_frame(expand.grid(src = src_tbl$src, ref = ref_tbl$ref)) 

all_comb 
# A tibble: 6 x 2 
    src ref 
    <fctr> <fctr> 
1  S1  P1 
2  S2  P1 
3  S1  P2 
4  S2  P2 
5  S1  P3 
6  S2  P3 

現在,我們可以用嵌套的數據連接,而我綁定對於每個組合,這些列必須具有包含分數的單個列表列。

all_comb <- all_comb %>% 
    left_join(ref_tbl, by = "ref") %>% 
    left_join(src_tbl, by = "src") %>% 
    mutate(data = purrr::map2(data.x, data.y, bind_cols)) %>% 
    select(-data.x, -data.y) 

all_comb 
# A tibble: 6 x 3 
    src ref     data 
    <fctr> <fctr>     <list> 
1  S1  P1 <tibble [3 x 2]> 
2  S2  P1 <tibble [3 x 2]> 
3  S1  P2 <tibble [3 x 2]> 
4  S2  P2 <tibble [3 x 2]> 
5  S1  P3 <tibble [3 x 2]> 
6  S2  P3 <tibble [3 x 2]> 

最後,我映射ks.test如果設置的每個數據,用掃帚拿到p.value的要求。

final <- all_comb %>% 
    mutate(ks = purrr::map(data, ~ks.test(.$score_ref, .$score_src)), 
    tidied = purrr::map(ks, broom::tidy)) %>% 
    tidyr::unnest(tidied) %>% 
    select(src, ref, p.value) 
Warning message: cannot compute exact p-value with ties 
Warning message: cannot compute exact p-value with ties 

final 
# A tibble: 6 x 3 
    src ref p.value 
    <fctr> <fctr>  <dbl> 
1  S1  P1 0.5175508 
2  S2  P1 0.5175508 
3  S1  P2 0.6000000 
4  S2  P2 0.6000000 
5  S1  P3 0.6000000 
6  S2  P3 0.6000000 
+0

謝謝,你的解決方案對我很重要。但是你的方法有一個可能的錯誤。請看這個http://stackoverflow.com/questions/43806347/combination-of-purrrmap-and-dplyr-give-inconsistent-result-with-a-statistical – pdubois

+1

我認爲這種方法很好,但你必須小心對於ks.test中的參數的順序。對於大量的數據來說,'data.table'方法可能會更快。 – FlorianGD

0

那麼它花了一段時間,但我拼湊在一起哈克解決方案。我敢肯定,有一種更優雅的方式,如ddply,但是這超出了我的想象。 (注意我的P值均爲到您有點不同,因爲我縮短了數據幀中的一個)

library(dplyr) 
library(tidyr) 
ref_tbl<-ref_tbl[1:6,]#make equal rows for this example 

dd<-as.data.frame(cbind(paste(src_tbl$Sample_name,'-', src_tbl$score), 
         paste(ref_tbl$Sample_name,'-',ref_tbl$score)))#concatenate sample names with their scores 


ex<-expand.grid(x = levels(dd$V1), y = levels(dd$V2))#obtain all combinations 

all<-ex %>% 
    separate(x, c("S","svalue"),"-")%>% 
    separate(y, c("P","pvalue"),"-")#unseparate now that we have the combinations 

all$svalue<-as.numeric(all$svalue)#change to numeric for ks.test 
all$pvalue<-as.numeric(all$pvalue) 

x<-split(all,list(all$S,all$P))#split into a list of dataframes showing individual combinations 

ks<-lapply(x,function(x)ks.test(x[,2],x[,4]))#apply ks.test to each individual combination 

pval<-lapply(ks, '[[', 'p.value')#extract pvalues 

do.call(rbind,pval)#final result at last! 

#    [,1] 
#S1 .P1 0.5175508 
#S2 .P1 0.5175508 
#S1 .P2 0.1389203 
#S2 .P2 0.1389203 
#S1 .P3 0.1389203 
#S2 .P3 0.1389203 
相關問題