2017-06-13 54 views
1

我有一個data.frame,我想將子集(按行)分成(重疊)「批次」,然後是purrr:::map這些批次到一個函數。在下面的例子中,ddata.frame我想子集和批:使用tidyverse方案按列值進行子集和row_binding

set.seed(19) 
n1 <- data.frame(c0= "N",c1 = rep("A",4),c2 = rep(c("i","j"),2), num = rnorm(4)) 
n2 <- data.frame(c0= "N", c1 = rep("B",6),c2 = rep(c("i","j"),3), num = rnorm(3)) 
y1 <- data.frame(c0 = "Y", c1 = rep("A",2),c2 = c("i","j"), num = rnorm(2)) 
y2 <- data.frame(c0 = "Y", c1 = rep("B",4),c2 = rep(c("i","j"),each = 2), num = rnorm(2)) 

d <- rbind(y1,y2,n1,n2) 

這裏是d

# c0 c1 c2  num 
# 1 Y A i -0.7447795 
# 2 Y A j -0.2597870 
# 3 Y B i -0.1830838 
# 4 Y B i 0.5186300 
# 5 Y B j -0.1830838 
# 6 Y B j 0.5186300 
# 7 N A i -1.1894537 
# 8 N A j 0.3885812 
# 9 N A i -0.3443333 
# 10 N A j -0.5478961 
# 11 N B i 0.9806622 
# 12 N B j -0.2366460 
# 13 N B i 0.8097397 
# 14 N B j 0.9806622 
# 15 N B i -0.2366460 
# 16 N B j 0.8097397 

的子集的配方是

  1. 子組c0 - >給組YN
  2. c0=="N"子集由c1內 - >給予組NANB
  3. 子集中的每個的NANB通過c2 - >給予組NAiNAjNBiNBj
  4. row_bind N?iY?iN?jY?j(其中?AB) - >給出最後4個數據子集

在R:

subset.Yi <- d %>% filter(c0=="Y"& c2=="i") 
subset.Yj <- d %>% filter(c0=="Y"& c2=="j") 

list(
    d1 = d %>% filter(c0=="N" & c1 == "A", c2 == "i") %>% rbind(subset.Yi), 
    d2 = d %>% filter(c0=="N" & c1 == "B", c2 == "i") %>% rbind(subset.Yi), 
    d3 = d %>% filter(c0=="N" & c1 == "A", c2 == "j") %>% rbind(subset.Yj), 
    d4 = d %>% filter(c0=="N" & c1 == "B", c2 == "j") %>% rbind(subset.Yj) 
) %>% 
tibble::tibble(batches = paste0("batch",1:length(.)),data = .) ->tmp 

如果c2匹配不是我可以這樣做很重要:

d %>% filter(.,c0 == "N") %>% 
    group_by(.,c1) %>% 
    do(batches = rbind(d[d$c0=="Y"],.)) -> tmp 

但事實並非這麼回事。先謝謝你! BTW,我知道外面tidyverse這是可行的,但我通過了我的代碼的其餘tidyverse計劃,我希望能保持一致。

回答

0

下面是在這種情況下工作(雖然,這將是巨大的,看看別人的其他可能更爲普遍的方法)的解決方案。

tmp <- d %>% 
    group_by(c2) %>% 
    nest(.key = c2) %>% 
    mutate(c2 = map(c2,~ .x %>% 
        filter(.,c0 == "N") %>% 
        group_by (.,c1) %>% 
        do(batches = bind_rows(
         .x %>% filter(.,c0 == "Y") %>% select(-c1), 
         (.) %>% select(-c1) )) 
       )) 

tmp這裏將包含四個子集。然後,我可以做類似

tmp %>% unnest(c2) %>% .$batches %>% map(.,~sum(.$num)) %>% unlist 

這給numcolSum在每個4個子組。

[1] -1.94302047 1.14452254 -0.08355576 1.62951506 

邊注:取消選擇c1在技術上是沒有必要在這裏,但因爲我是row_binding使得數據幀的一部分被忽視的價值c1(見上子集配方和注意?),C1的價值感到困惑,所以我刪除了它。