2017-06-06 46 views
0

我從可靠性值得懷疑一些數據源的數據:簡化數據

date  | value  | source 
=================================== 
2011-09-30 | 10.9910 | best 
2011-12-31 | 11.5000 | ok 
2011-12-31 | 11.5290 | best 
2012-03-31 | 12.8477 | ok 
2012-03-31 | 12.4677 | worst 
2012-06-30 | -1.5  | unacceptable 

我想清理成一個簡單的時間序列,與基於數據源的優先順序: 「最好」擊敗「好」擊敗「最差」,並拋棄「不可接受」。在我的例子中:

date  | value 
======================== 
2011-09-30 | 10.9910 
2011-12-31 | 11.5290 
2012-03-31 | 12.8477 
2012-06-30 | NA   # or just skip this line 

有關如何很好地做到這一點的任何想法?該dput我的樣本數據是:

df = structure(list(date = structure(c(15247, 15339, 15339, 15430, 15430, 15491, 15613, 15613, 15705, 15795, 15795, 15886, 15978, 15978, 15978, 16070, 16070, 16070, 16160, 16160), class = "Date"),  
    value = c(10.991, 11.500, 11.529, 12.8477, 12.4677, 11.542, 12.1203, 12.1146, 12.5053, 13.3556, 13.3628, 13.3372, 13.844, 13.844, 13.8419, 15.3403, 15.3403, 15.3306, 15.202, 15.202 ), 
    source = c("best", "ok", "best", "ok", "worst", "ok", "ok", "worst", "ok", "ok", "worst", "unacceptable", "ok", "best", "worst", "ok", "best", "worst", "ok", "best")), 
    row.names = c(NA, 20L), 
    .Names = c("date", "value", "source"), 
    class = "data.frame") 

回答

1

您可以將source至因素並加以比較。

library(dplyr) 
df %>% 
    mutate(source=factor(source, c("best", "ok", "worst"))) %>% 
    group_by(date) %>% 
    top_n(-1, source) %>% 
    ungroup() 

# A tibble: 10 x 3 
     date value source 
     <date> <dbl> <fctr> 
1 2011-09-30 10.9910 best 
2 2011-12-31 11.5290 best 
3 2012-03-31 12.8477  ok 
4 2012-05-31 11.5420  ok 
5 2012-09-30 12.1203  ok 
6 2012-12-31 12.5053  ok 
7 2013-03-31 13.3556  ok 
8 2013-09-30 13.8440 best 
9 2013-12-31 15.3403 best 
10 2014-03-31 15.2020 best