我有重複記錄一些數據,他們中的一些不應該存在(mark
和recov
應該是隻有一次每band
,recap
可以出現幾次)。我想根據列中的某些值(variable=="mark"
)選擇獨特的觀察值(band
),並保留其餘數據"recap"
和"recov"
。鮮明(dplyr)都不盡如人意 - 獨特的觀察基於標準
我用dyplr
,到組我的數據由帶然後選擇唯一的記錄時列variable=="mark"
,這是我的代碼:
uniq <- df %>%group_by(band) %>% distinct(variable=="mark")
我發現它不工作很好,找一些意見另一個時從variable=="recap"
值已經被刪除(例如:在band=113749924
,從1993年的概括值丟失,相同情況下在band=113728509
有一個概括值缺失)
這是一個數據例如:
structure(list(band = c(113728501L, 113728502L, 113728503L, 113728504L,
113728505L, 113728505L, 113728506L, 113728506L, 113728507L, 113728508L,
113728509L, 113728509L, 113728509L, 113728509L, 113728510L, 113728510L,
113729709L, 113729709L, 113729709L, 113729710L, 113729711L, 113729712L,
113729713L, 113729714L, 113729715L, 113729716L, 113729717L, 113729718L,
113729719L, 113729720L, 113729720L, 113729721L, 113729722L, 113729723L,
113729724L, 113729725L, 113729726L, 113729727L, 113729728L, 113729729L,
113729730L, 113729731L, 113729732L, 113729733L, 113729733L, 113729733L,
113729734L, 113729735L, 113729735L, 113729735L, 113729914L, 113729914L,
113729914L, 113729914L, 113729915L, 113729916L, 113729917L, 113729918L,
113729919L, 113729920L, 113729921L, 113729922L, 113729923L, 113729924L,
113729925L, 113729926L, 113729927L, 113729928L, 113729929L, 113749923L,
113749924L, 113749924L, 113749924L), variable = structure(c(1L,
1L, 1L, 1L, 1L, 3L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L,
3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 2L, 1L, 1L, 3L,
2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 3L, 2L), .Label = c("mark", "recap",
"recov"), class = "factor"), year = c(1994L, 1994L, 1994L, 1994L,
1994L, 2012L, 1994L, 1999L, 1994L, 1994L, 1994L, 1994L, 2002L,
2003L, 1994L, 1996L, 1994L, 2002L, 1998L, 1994L, 1994L, 1994L,
1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1995L,
1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L,
1994L, 1994L, 1994L, 1994L, 2002L, 2001L, 1994L, 1994L, 1999L,
1998L, 1994L, 1994L, 1999L, 2005L, 1994L, 1994L, 1994L, 1994L,
1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L,
1994L, 1994L, 1991L, 1991L, 1994L, 1993L)), .Names = c("band",
"variable", "year"), class = "data.frame", row.names = c(NA,
-73L))
最後我想有類似的信息(例如對於113749924):
band year variable
113749924 1991 mark
113749924 1993 recap
113749924 1994 recov
能否請你幫我找到什麼是錯的或可能給我建議的替代代碼?
非常感謝!
將內嵌數據發佈爲'dput'的輸出是獲得幫助的最佳方式。外部鏈接無用。 – Gopala
非常感謝您的建議!我今天學到了一些新東西 – MSS
你可以試試'distinct(df)'。或者,如果使用'group_by',則可以使用'slice'來獲取第一行重複集合。 – Gopala