2017-01-22 51 views
0

再次延續我以前的2個問題,但是有一個稍微不同的問題。我一直在處理的數據中存在另一個皺紋:使用dplyr丟失數據

date <- c("2016-03-24","2016-03-24","2016-03-24","2016-03-24","2016-03-24", 
      "2016-03-24","2016-03-24","2016-03-24","2016-03-24") 
location <- c(1,1,2,2,3,3,4,"out","out") 
sensor <- c(1,16,1,16,1,16,1,1,16) 
Temp <- c(35,34,92,42,21,47,42,63,12) 
df <- data.frame(date,location,sensor,Temp) 

我的一些數據有缺失值。他們不是NA。他們只是不在數據期間。

我想從位置「4」減去位置「出」,忽略其他位置,我想通過日期和傳感器來做到這一點。我已經成功地與都用下面的代碼

df %>% 
    filter(location %in% c(4, 'out')) %>% 
    group_by(date, sensor) %>% 
    summarize(Diff = Temp[location=="4"] - Temp[location=="out"], 
      location = first(location)) %>% 
    select(1, 2, 4, 3) 

然而,對於數據有缺失的日期我得到以下錯誤Error: expecting a single value數據數據位置做到了這一點。我認爲這是因爲dplyr在到達缺少數據點時不知道該怎麼做。

做了一些研究,看起來好像是do是要走的路,但是它返回一個沒有任何值的數據幀。

df %>% 
    filter(location %in% c(4, 'out')) %>% 
    group_by(date, sensor) %>% 
    do(Diff = Temp[location=="4"] - Temp[location=="out"], 
      location = first(location)) %>% 
    select(1, 2, 4, 3) 

有沒有辦法覆蓋dplyr,並告訴它返回NA如果找不到減去條目之一?

+0

順便說一句,我得到了同樣的錯誤爲您的數據,甚至儘管沒有缺失的日期值! – Rahul

+0

有缺失的值 – hrbrmstr

回答

1

如果我們想返回NA,可能的選項是

library(dplyr) 
df %>% 
    filter(location %in% c(4, 'out')) %>% 
    group_by(date, sensor) %>% 
    arrange(sensor, location) %>% 
    summarise(Diff = if(n()==1) NA else diff(Temp), location = first(location)) %>% 
    select(1, 2, 4, 3) 
#  date sensor location Diff 
#  <fctr> <dbl> <fctr> <dbl> 
#1 2016-03-24  1  4 21 
#2 2016-03-24  16  out NA 

data.table的等效選項是

library(data.table) 
setDT(df)[location %in% c(4, 'out')][ 
    order(sensor, location), .(Diff = if(.N==1) NA_real_ else diff(Temp), 
     location = location[1]), .(date, sensor)][, c(1, 2, 4, 3), with = FALSE] 
#   date sensor location Diff 
#1: 2016-03-24  1  4 21 
#2: 2016-03-24  16  out NA 
3
library(tidyverse) 

date <- c("2016-03-24", "2016-03-24", "2016-03-24", "2016-03-24", "2016-03-24", 
      "2016-03-24", "2016-03-24", "2016-03-24", "2016-03-24") 
location <- c(1, 1, 2, 2, 3, 3, 4, "out", "out") 
sensor <- c(1, 16, 1, 16, 1, 16, 1, 1, 16) 
Temp <- c(35, 34, 92, 42, 21, 47, 42, 63, 12) 

df <- data_frame(date, location, sensor, Temp) 

# edge case helper 
`%||0%` <- function (x, y) { if (is.null(x) | length(x) == 0) y else x } 

df %>% 
    filter(location %in% c(4, 'out')) %>% 
    mutate(location=factor(location, levels=c("4", "out"))) %>%    # make location a factor 
    arrange(sensor, location) %>%           # order it so we can use diff() 
    group_by(date, sensor) %>% 
    summarize(Diff = diff(Temp) %||0% NA, location = first(location)) %>% # deal with the edge case 
    select(1, 2, 4, 3) 
## Source: local data frame [2 x 4] 
## Groups: date [1] 
## 
##   date sensor location Diff 
##  <chr> <dbl> <fctr> <dbl> 
## 1 2016-03-24  1  4 21 
## 2 2016-03-24  16  out NA