2015-11-24 134 views
0

嗨,我對R中重複項的問題感到困惑。我環顧四周,似乎找不到任何幫助。我有這樣的數據集根據R中的條件識別並刪除重複項

x = data.frame(id = c("A","A","A","A","A","A","A","B","B","B","B"), 
       StartDate = c("09/07/2006", "09/07/2006", "09/07/2006", "08/10/2006", 
           "08/10/2006", "09/04/2007", "02/03/2011","05/05/2005", "08/06/2009", "07/09/2009", "07/09/2009"), 
       EndDate = c("06/08/2006", "06/08/2006", "06/08/2006", "19/11/2006", "19/11/2006", "07/05/2007", "30/03/2011", 
          "02/06/2005", "06/07/2009", "05/10/2009", "05/10/2009"), 
       Group = c(1,1,1,2,2,3,4,2,3,4,4), 
       TestDate = c("09/06/2006", "08/09/2006", "08/10/2006", "08/09/2006", "08/10/2006", "NA", "02/03/2011", 
           "NA", "07/09/2009", "07/09/2009", "08/10/2009"), 
       Code = c(4,4,4858,4,4858,NA,4,NA, 795, 795, 4) 
      ) 

> x 
    id StartDate EndDate Group TestDate Code 
1 A 09/07/2006 06/08/2006  1 09/06/2006 4 
2 A 09/07/2006 06/08/2006  1 08/09/2006 4 
3 A 09/07/2006 06/08/2006  1 08/10/2006 4858 
4 A 08/10/2006 19/11/2006  2 08/09/2006 4 
5 A 08/10/2006 19/11/2006  2 08/10/2006 4858 
6 A 09/04/2007 07/05/2007  3   NA NA 
7 A 02/03/2011 30/03/2011  4 02/03/2011 4 
8 B 05/05/2005 02/06/2005  2   NA NA 
9 B 08/06/2009 06/07/2009  3 07/09/2009 795 
10 B 07/09/2009 05/10/2009  4 07/09/2009 795 
11 B 07/09/2009 05/10/2009  4 08/10/2009 4 

所以基本上我想要做的是通過ID在TestDate變量中確定重複項。例如08/09/2006和08/10/2006的日期似乎在同一個人中重複,但對於不同的組,我不希望相同的測試日期按ID分組。選擇哪個TestDate的標準是將TestDate的天數與StartDate和EndDate的差異用於不同的組,然後保持天數差異最小的那個。例如,關於2006年10月8日的日期,我想保留第5行,因爲TestDate比StartDate更接近第3行的相同差異。最後,我希望獲得類似於此的數據集

> xfinal 
    id StartDate EndDate Group TestDate Code 
1 A 09/07/2006 06/08/2006  1 09/06/2006 4 
4 A 08/10/2006 19/11/2006  2 08/09/2006 4 
5 A 08/10/2006 19/11/2006  2 08/10/2006 4858 
6 A 09/04/2007 07/05/2007  3   NA NA 
7 A 02/03/2011 30/03/2011  4 02/03/2011 4 
8 B 05/05/2005 02/06/2005  2   NA NA 
10 B 07/09/2009 05/10/2009  4 07/09/2009 795 
11 B 07/09/2009 05/10/2009  4 08/10/2009 4 

任何幫助,將不勝感激。謝謝

+0

'x [!一些更多的選項[這裏](http://stackoverflow.com/questions/13967063/remove-duplicate-rows-in-r) – rawr

回答

0
x$StartDate <- as.Date(x$StartDate,format="%d/%m/%Y") 
x$EndDate <- as.Date(x$EndDate,format="%d/%m/%Y") 
x$TestDate <- as.Date(x$TestDate,format="%d/%m/%Y") 
x$Diff <- difftime(x$EndDate,x$StartDate,"days") 

x <- x[order(x$id,x$Diff),] 

x <- x[!duplicated(x[,c("id","TestDate")]),] 
x$Diff <- NULL 
x