2015-11-06 124 views
1

我想計算兩個日期之間的變量的均值,下面是可重現的數據幀。如何計算兩個日期之間的變量的均值

year <- c(1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996, 
     1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996, 
     1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997, 
     1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997) 
month <- c("JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC") 
station <- c("A","A","A","A","A","A","A","A","A","A","A","A", 
     "B","B","B","B","B","B","B","B","B","B","B","B") 

concentration <- as.numeric(round(runif(48,20,40),1)) 

df <- data.frame(year,month,station,concentration) 


id <- c(1,2,3,4) 
station1996 <- c("A","A","B","B") 
station1997 <- c("B","A","A","B") 
start <- c("06/01/1996","07/01/1996","07/01/1996","08/01/1996") 
end <- c("04/01/1997","04/01/1997","04/01/1997","05/01/1997") 

participant <- data.frame(id,station1996,station1997,start,end) 
participant$start <- as.Date(participant$start, format = "%m/%d/%Y") 
participant$end <- as.Date(participant$end, format = "%m/%d/%Y") 

所以我有兩個數據集,如下

df 
    year month station concentration 
1 1996 JAN  A   24.4 
2 1996 FEB  A   37.0 
3 1996 MAR  A   39.5 
4 1996 APR  A   28.0 
... 
45 1997 SEP  B   37.7 
46 1997 OCT  B   35.2 
47 1997 NOV  B   26.8 
48 1997 DEC  B   40.0 

participant 
    id station1996 station1997  start  end 
1 1   A   B 1996-06-01 1997-04-01 
2 2   A   A 1996-07-01 1997-04-01 
3 3   B   A 1996-07-01 1997-04-01 
4 4   B   B 1996-08-01 1997-05-01 

每個ID,我想計算開始和結束日期(月日)的平均濃度。注意到電臺可能會在幾年之間發生變化。

例如對於id = 1,我想計算1996年6月到1997年4月的平均濃度。這應該基於1996年6月至1996年12月在A站的濃度以及1997年1月至1997年4月的濃度臺B.

任何人都可以幫忙嗎?

非常感謝。

+1

第1步:將'start'和'end'轉換爲'Date'或'POSIXct'格式,並將'year'和'month'作爲同一格式的新列。 – MichaelChirico

+0

您也可以將它們轉換爲「1997-10」形式的字符串。那麼你可以像'平均值(濃度[日期> =開始和日期<=結束])'庫(動物園)' –

+0

; as.yearmon(參與者$ start)'等等......在這種情況下也可能非常方便,如果你不想處理稍微笨拙的POSIXct格式。 – thelatemail

回答

1

這裏是一個data.table解決方案。基本思路是將起始範圍中的所有日期都列爲yearmon,對於每個id,然後將其用作濃度表df的索引。這有點複雜,所以希望有人會出現並向你展示一個更簡單的方法。

library(data.table) 
library(zoo)   # for as.yearmon(...) 
setDT(df)    # convert to data.table 
setDT(participant) 
df[, yrmon:= as.yearmon(paste(year,month,sep="-"), format="%Y-%B")] # add year-month column 
p.melt <- reshape(participant, varying=2:3, direction="long", sep="", timevar="year") 
x <- participant[, .(date=seq(start,end,by="month")), by=id] 
x[, c("year","yrmon"):=.(year(date),as.yearmon(date))]   # add year and year-month 
x[p.melt, station:=station, on=c("id","year")]     # add station 
x[df, conc:= concentration, on=c("yrmon","station"), nomatch=0] # add concentration 
setorder(x,id) # not necessary, but makes it easier to interpret x 
result <- x[, .(mean.conc=mean(conc)), by=id]     # mean(conc) by id 
result 
# id mean.conc 
# 1: 1 28.61818 
# 2: 2 28.56000 
# 3: 3 28.44000 
# 4: 4 29.60000 

所以,首先我們將所有東西都轉換成data.tables。然後我們添加一個yrmon列到df以供稍後索引。然後,我們通過將participant重塑爲長格式創建p.melt,以便該工作站位於一列中,並且指示器(1996或1997)位於單獨的列中。然後我們創建一個臨時表x,其中包含每個id的日期序列,併爲每個日期添加year和yrmon。然後我們將p.meltidyear合併爲x。然後我們使用yrmonstation合併xdf以獲得適當的濃度。然後我們簡單地使用mean(...)x中通過id彙總conc

相關問題