2016-05-25 53 views
2

我有重複記錄一些數據,他們中的一些不應該存在(markrecov應該是隻有一次每bandrecap可以出現幾次)。我想根據列中的某些值(variable=="mark")選擇獨特的觀察值(band),並保留其餘數據"recap""recov"鮮明(dplyr)都不盡如人意 - 獨特的觀察基於標準

我用dyplr,到組我的數據由帶然後選擇唯一的記錄時列variable=="mark",這是我的代碼:

uniq <- df %>%group_by(band) %>% distinct(variable=="mark") 

我發現它不工作很好,找一些意見另一個時從variable=="recap"值已經被刪除(例如:在band=113749924,從1993年的概括值丟失,相同情況下在band=113728509有一個概括值缺失)

這是一個數據例如:

structure(list(band = c(113728501L, 113728502L, 113728503L, 113728504L, 
113728505L, 113728505L, 113728506L, 113728506L, 113728507L, 113728508L, 
113728509L, 113728509L, 113728509L, 113728509L, 113728510L, 113728510L, 
113729709L, 113729709L, 113729709L, 113729710L, 113729711L, 113729712L, 
113729713L, 113729714L, 113729715L, 113729716L, 113729717L, 113729718L, 
113729719L, 113729720L, 113729720L, 113729721L, 113729722L, 113729723L, 
113729724L, 113729725L, 113729726L, 113729727L, 113729728L, 113729729L, 
113729730L, 113729731L, 113729732L, 113729733L, 113729733L, 113729733L, 
113729734L, 113729735L, 113729735L, 113729735L, 113729914L, 113729914L, 
113729914L, 113729914L, 113729915L, 113729916L, 113729917L, 113729918L, 
113729919L, 113729920L, 113729921L, 113729922L, 113729923L, 113729924L, 
113729925L, 113729926L, 113729927L, 113729928L, 113729929L, 113749923L, 
113749924L, 113749924L, 113749924L), variable = structure(c(1L, 
1L, 1L, 1L, 1L, 3L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 
3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 2L, 1L, 1L, 3L, 
2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 3L, 2L), .Label = c("mark", "recap", 
"recov"), class = "factor"), year = c(1994L, 1994L, 1994L, 1994L, 
1994L, 2012L, 1994L, 1999L, 1994L, 1994L, 1994L, 1994L, 2002L, 
2003L, 1994L, 1996L, 1994L, 2002L, 1998L, 1994L, 1994L, 1994L, 
1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1995L, 
1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 
1994L, 1994L, 1994L, 1994L, 2002L, 2001L, 1994L, 1994L, 1999L, 
1998L, 1994L, 1994L, 1999L, 2005L, 1994L, 1994L, 1994L, 1994L, 
1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 
1994L, 1994L, 1991L, 1991L, 1994L, 1993L)), .Names = c("band", 
"variable", "year"), class = "data.frame", row.names = c(NA, 
-73L)) 

最後我想有類似的信息(例如對於113749924):

band  year variable 
113749924 1991 mark 
113749924 1993 recap 
113749924 1994 recov 

能否請你幫我找到什麼是錯的或可能給我建議的替代代碼?

非常感謝!

+0

將內嵌數據發佈爲'dput'的輸出是獲得幫助的最佳方式。外部鏈接無用。 – Gopala

+0

非常感謝您的建議!我今天學到了一些新東西 – MSS

+1

你可以試試'distinct(df)'。或者,如果使用'group_by',則可以使用'slice'來獲取第一行重複集合。 – Gopala

回答

1

一種選擇是將group_by「頻段」,filter其中「可變」是「標記」的行,得到了distinct行,然後將其綁定(bind_rows)與filter ED的數據集,其中「可變」不是「標記」。

df %>% 
group_by(band) %>% 
filter(variable=="mark") %>% 
ungroup() %>% 
distinct() %>% 
bind_rows(., filter(df, variable!="mark")) %>% 
arrange(band) %>% 
data.frame 
     band variable year 
1 113728501  mark 1994 
2 113728502  mark 1994 
3 113728503  mark 1994 
4 113728504  mark 1994 
5 113728505  mark 1994 
6 113728505 recov 2012 
7 113728506  mark 1994 
8 113728506 recap 1999 
9 113728507  mark 1994 
10 113728508  mark 1994 
11 113728509  mark 1994 ###only one mark. 
12 113728509 recap 2002 
13 113728509 recap 2003 
14 113728510  mark 1994 
15 113728510 recap 1996 
16 113729709  mark 1994 
17 113729709 recov 2002 
18 113729709 recap 1998 
19 113729710  mark 1994 
20 113729711  mark 1994 
21 113729712  mark 1994 
22 113729713  mark 1994 
23 113729714  mark 1994 
24 113729715  mark 1994 
25 113729716  mark 1994 
26 113729717  mark 1994 
27 113729718  mark 1994 
28 113729719  mark 1994 
29 113729720  mark 1994 
30 113729720 recov 1995 
31 113729721  mark 1994 
32 113729722  mark 1994 
33 113729723  mark 1994 
34 113729724  mark 1994 
35 113729725  mark 1994 
36 113729726  mark 1994 
37 113729727  mark 1994 
38 113729728  mark 1994 
39 113729729  mark 1994 
40 113729730  mark 1994 
41 113729731  mark 1994 
42 113729732  mark 1994 
43 113729733  mark 1994 
44 113729733 recov 2002 
45 113729733 recap 2001 
46 113729734  mark 1994 
47 113729735  mark 1994 
48 113729735 recov 1999 
49 113729735 recap 1998 
50 113729914  mark 1994 
51 113729914 recap 1999 
52 113729914 recap 2005 
53 113729915  mark 1994 
54 113729916  mark 1994 
55 113729917  mark 1994 
56 113729918  mark 1994 
57 113729919  mark 1994 
58 113729920  mark 1994 
59 113729921  mark 1994 
60 113729922  mark 1994 
61 113729923  mark 1994 
62 113729924  mark 1994 
63 113729925  mark 1994 
64 113729926  mark 1994 
65 113729927  mark 1994 
66 113729928  mark 1994 
67 113729929  mark 1994 
68 113749923  mark 1991 
69 113749924  mark 1991 
70 113749924 recov 1994 
71 113749924 recap 1993 

或者另一選擇是將group_by二者「頻段」和「可變」,則創建一個邏輯狀態,其中row_number()大於1和「變量」是「標記」,否定它(!)和filter行。

df %>% 
    group_by(band, variable) %>% 
    filter(!(row_number() >1 & variable =="mark")) 
+0

對不起,我覺得我還不夠清楚(我已經在我的文章中改變了它),我還想保留有關「recov」和「recap」的其他數據,以及「mark 「(有時是雙倍) – MSS

+2

你可以砍掉'group_by'和'ungroup' – alistaire