2015-04-23 26 views
2

我有一個相對複雜的表格合併/擴展問題。下面我列出了一個示例DATA和所需的RESULT表。我有4個因子(SITE,DATE,SAMPLE,TAXA)和三個數字列(1,2,3)。我需要讓每個SITE,DATESAMPLE具有TAXA 1,2,100和150.通過此過程,我需要用適當的信息填充空因子單元格,並用0填充數字列。R:合併表格並用因子信息填充空單元格

對於大型「示例」數據集,我表示歉意,但它們捕獲了我的數據集的複雜性。我的完整數據集有點大,包括4 SITE,15 DATE,12 SAMPLE和167 TAXA。使用dplyr的解決方案是首選,但我肯定會接受其他選擇。在excel中這樣做需要一個庫恩的年齡!提前致謝。

DATA 
    SITE DATE SAMPLE TAXA 1 2 3 
    NSV 8-Jul-13 Pool 1 10 10 10 
    NSV 8-Jul-13 Pool 2 10 10 10 
    NSV 8-Jul-13 Riffle 1 10 10 10 
    NSV 8-Jul-13 Riffle 2 10 10 10 
    NSV 23-Oct-13 Pool 1 10 10 10 
    NSV 23-Oct-13 Pool 2 10 10 10 
    NSV 23-Oct-13 Riffle 1 10 10 10 
    NSV 23-Oct-13 Riffle 2 10 10 10 
    SFP 4-Jul-13 Pool 1 10 10 10 
    SFP 4-Jul-13 Pool 2 10 10 10 
    SFP 4-Jul-13 Riffle 1 10 10 10 
    SFP 4-Jul-13 Riffle 2 10 10 10 
    SFP 27-Oct-13 Pool 1 10 10 10 
    SFP 27-Oct-13 Pool 2 10 10 10 
    SFP 27-Oct-13 Pool 100 10 10 10 
    SFP 27-Oct-13 Pool 150 10 10 10 
    SFP 27-Oct-13 Riffle 1 10 10 10 
    SFP 27-Oct-13 Riffle 2 10 10 10 
    SFP 27-Oct-13 Riffle 100 10 10 10 
    SFP 27-Oct-13 Riffle 150 10 10 10 

RESULT 
    SITE DATE SAMPLE TAXA 1 2 3 
    NSV 8-Jul-13 Pool 1 10 10 10 
    NSV 8-Jul-13 Pool 2 10 10 10 
    NSV 8-Jul-13 Pool 100 0 0 0 
    NSV 8-Jul-13 Pool 150 0 0 0 
    NSV 8-Jul-13 Riffle 1 10 10 10 
    NSV 8-Jul-13 Riffle 2 10 10 10 
    NSV 8-Jul-13 Riffle 100 0 0 0 
    NSV 8-Jul-13 Riffle 150 0 0 0 
    NSV 23-Oct-13 Pool 1 10 10 10 
    NSV 23-Oct-13 Pool 2 10 10 10 
    NSV 23-Oct-13 Pool 100 0 0 0 
    NSV 23-Oct-13 Pool 150 0 0 0 
    NSV 23-Oct-13 Riffle 1 10 10 10 
    NSV 23-Oct-13 Riffle 2 10 10 10 
    NSV 23-Oct-13 Riffle 100 0 0 0 
    NSV 23-Oct-13 Riffle 150 0 0 0 
    SFP 4-Jul-13 Pool 1 10 10 10 
    SFP 4-Jul-13 Pool 2 10 10 10 
    SFP 4-Jul-13 Pool 100 0 0 0 
    SFP 4-Jul-13 Pool 150 0 0 0 
    SFP 4-Jul-13 Riffle 1 10 10 10 
    SFP 4-Jul-13 Riffle 2 10 10 10 
    SFP 4-Jul-13 Riffle 100 0 0 0 
    SFP 4-Jul-13 Riffle 150 0 0 0 
    SFP 27-Oct-13 Pool 1 10 10 10 
    SFP 27-Oct-13 Pool 2 10 10 10 
    SFP 27-Oct-13 Pool 100 10 10 10 
    SFP 27-Oct-13 Pool 150 10 10 10 
    SFP 27-Oct-13 Riffle 1 10 10 10 
    SFP 27-Oct-13 Riffle 2 10 10 10 
    SFP 27-Oct-13 Riffle 100 10 10 10 
    SFP 27-Oct-13 Riffle 150 10 10 10 

回答

2

這裏是一個非dplyr解決方案。我確信有更優雅的方法,但這裏有一個基本的R方法。我叫你輸入data.frame d

d2 <- expand.grid(apply(unique(d[,c("SITE","DATE")]), 1, paste, collapse=" "), 
        unique(d$SAMPLE), unique(d$TAXA)) 
d2 <- cbind(matrix(unlist(strsplit(as.character(d2$Var1), " ")), ncol=2, byrow=TRUE), 
      d2[,2:3]) 
names(d2)<-names(d)[1:4] 

d2 <- merge(d2,d, all.x=TRUE) 

d2[which(is.na(d2), arr.ind=TRUE)] <- 0 

輸出:

SITE  DATE SAMPLE TAXA X1 X2 X3 
1 NSV 23-Oct-13 Pool 1 10 10 10 
2 NSV 23-Oct-13 Pool 2 10 10 10 
3 NSV 23-Oct-13 Pool 100 0 0 0 
4 NSV 23-Oct-13 Pool 150 0 0 0 
5 NSV 23-Oct-13 Riffle 1 10 10 10 
6 NSV 23-Oct-13 Riffle 2 10 10 10 
7 NSV 23-Oct-13 Riffle 100 0 0 0 
8 NSV 23-Oct-13 Riffle 150 0 0 0 
9 NSV 8-Jul-13 Pool 1 10 10 10 
10 NSV 8-Jul-13 Pool 2 10 10 10 
11 NSV 8-Jul-13 Pool 100 0 0 0 
12 NSV 8-Jul-13 Pool 150 0 0 0 
13 NSV 8-Jul-13 Riffle 1 10 10 10 
14 NSV 8-Jul-13 Riffle 2 10 10 10 
15 NSV 8-Jul-13 Riffle 100 0 0 0 
16 NSV 8-Jul-13 Riffle 150 0 0 0 
17 SFP 27-Oct-13 Pool 1 10 10 10 
18 SFP 27-Oct-13 Pool 2 10 10 10 
19 SFP 27-Oct-13 Pool 100 10 10 10 
20 SFP 27-Oct-13 Pool 150 10 10 10 
21 SFP 27-Oct-13 Riffle 1 10 10 10 
22 SFP 27-Oct-13 Riffle 2 10 10 10 
23 SFP 27-Oct-13 Riffle 100 10 10 10 
24 SFP 27-Oct-13 Riffle 150 10 10 10 
25 SFP 4-Jul-13 Pool 1 10 10 10 
26 SFP 4-Jul-13 Pool 2 10 10 10 
27 SFP 4-Jul-13 Pool 100 0 0 0 
28 SFP 4-Jul-13 Pool 150 0 0 0 
29 SFP 4-Jul-13 Riffle 1 10 10 10 
30 SFP 4-Jul-13 Riffle 2 10 10 10 
31 SFP 4-Jul-13 Riffle 100 0 0 0 
32 SFP 4-Jul-13 Riffle 150 0 0 0 
2

與您的數據開始:

dat <- structure(list(SITE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
          .Label = c("NSV", "SFP"), class = "factor"), 
         DATE = structure(c(4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
          .Label = c("23-Oct-13", "27-Oct-13", "4-Jul-13", "8-Jul-13" 
            ), class = "factor"), 
         SAMPLE = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Pool", "Riffle"), class = "factor"), 
         TAXA = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 100L, 150L, 1L, 2L, 100L, 150L), 
         v1 = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L), 
         v2 = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L), 
         v3 = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L)), 
       .Names = c("SITE", "DATE", "SAMPLE", "TAXA", "v1", "v2", "v3"), 
       class = "data.frame", row.names = c(NA, -20L)) 

一種技術,使用dplyr

library(dplyr) 
eg <- do.call('expand.grid', lapply(dat[,1:4], unique)) 
result <- right_join(dat, eg, by=c('SITE', 'DATE', 'SAMPLE', 'TAXA')) %>% 
    mutate(v1 = ifelse(is.na(v1), 0, v1), 
      v2 = ifelse(is.na(v2), 0, v2), 
      v3 = ifelse(is.na(v3), 0, v3)) %>% 
    arrange(SITE, DATE, SAMPLE, TAXA) 
head(result, n=8) 
## SITE  DATE SAMPLE TAXA v1 v2 v3 
## 1 NSV 23-Oct-13 Pool 1 10 10 10 
## 2 NSV 23-Oct-13 Pool 2 10 10 10 
## 3 NSV 23-Oct-13 Pool 100 0 0 0 
## 4 NSV 23-Oct-13 Pool 150 0 0 0 
## 5 NSV 23-Oct-13 Riffle 1 10 10 10 
## 6 NSV 23-Oct-13 Riffle 2 10 10 10 
## 7 NSV 23-Oct-13 Riffle 100 0 0 0 
## 8 NSV 23-Oct-13 Riffle 150 0 0 0 

採用arrange只是喜歡你的結果來安排,但數據是否完整而不管。

編輯

我意識到我有太多的產生data.frame。這是比較正確的,基於@弗蘭克的評論,更緊湊(arrange依然可選):

dat %>% select(SITE, DATE, SAMPLE) %>% unique() %>% 
    merge(y=list(TAXA=unique(dat$TAXA)), all.x=TRUE) %>% 
    arrange(SITE, DATE, SAMPLE, TAXA) 
## SITE  DATE SAMPLE TAXA 
## 1 NSV 23-Oct-13 Pool 1 
## 2 NSV 23-Oct-13 Pool 2 
## 3 NSV 23-Oct-13 Pool 100 
## 4 NSV 23-Oct-13 Pool 150 
## 5 NSV 23-Oct-13 Riffle 1 
## 6 NSV 23-Oct-13 Riffle 2 
## 7 NSV 23-Oct-13 Riffle 100 
## 8 NSV 23-Oct-13 Riffle 150 
## ...snip... 
+0

良好的漁獲物,THX。 – r2evans

+0

根據您的評論修正,謝謝! – r2evans

+0

謝謝@Frank和@ r2evans!我使用了一種混合方法....'dply'和'base package'。我的實際數據集比我提供的示例稍微複雜一些(例如,更多的因子列)。所以我使用了我從你的代碼中學到的東西來把東西放在一起。保重。 – Vesuccio