2016-07-04 38 views
1

我有兩個分開的數據集。一個包含參與者的位置,另一個包含測量站的位置和相應的值,在不同的時間點。下面我生成示例數據集。如何計算距離並返回具有最短距離的特定變量的值?

# dataset of value 
yearmon <- c("Jan 1996","Jan 1996","Jan 1996","Jan 1996","Jan 1996","Jan 1996", 
     "Feb 1996","Feb 1996","Feb 1996","Feb 1996","Feb 1996","Feb 1996", 
     "Mar 1996","Mar 1996","Mar 1996","Mar 1996","Mar 1996","Mar 1996", 
     "Apr 1996","Apr 1996","Apr 1996","Apr 1996","Apr 1996","Apr 1996", 
     "May 1996","May 1996","May 1996","May 1996","May 1996","May 1996", 
     "Jun 1996","Jun 1996","Jun 1996","Jun 1996","Jun 1996","Jun 1996") 

lon <- c(114.1592, 114.1294, 114.1144, 114.0228, 113.9763, 113.9431) 

lat <- c(22.35694, 22.31306, 22.33000, 22.37167, 22.37639, 22.45111) 

STN <- c("A","B","C","D","E","F") 

value <- runif(n=36, min=10, max=20) 

df<- data.frame(STN,lon,lat) 
df<- rbind(df,df,df,df,df,df) 
df <- cbind(df,yearmon,value) 
df$value[df$value < 12] <- NA 


# dataset of participant location 
id <- c(1,2,3,4) 
lon.p <- c(114.3608, 114.1850, 114.1581, 114.1683) 
lat.p <- c(22.44500, 22.33000, 22.28528, 22.37167) 
participant <- data.frame(id,lon.p,lat.p) 

樣品數據集如下。我想在每個時間點(yearmon)計算每個站(A-F)和每個參與者(1-4)之間的距離。並將特定時間點的值分配給特定參與者。我無法首先將參與者分配到一個站,因爲站的位置可能會在不同的時間點發生變化(儘管它在樣本數據集中不會改變)

也就是說,如果參與者1在1996年1月最靠近A站,那麼他/她應該分配值17.03357。

我喜歡大圓距離,用這樣的腳本也許計算: rdist.earth(LOCATION1,LOCATION2,英里= FALSE,R = 6371)

head(df,10) 
    STN  lon  lat yearmon value 
1 A 114.1592 22.35694 Jan 1996 17.03357 
2 B 114.1294 22.31306 Jan 1996  NA 
3 C 114.1144 22.33000 Jan 1996 17.98293 
4 D 114.0228 22.37167 Jan 1996 15.98854 
5 E 113.9763 22.37639 Jan 1996 16.78647 
6 F 113.9431 22.45111 Jan 1996 18.89551 
7 A 114.1592 22.35694 Feb 1996  NA 
8 B 114.1294 22.31306 Feb 1996 19.9
9 C 114.1144 22.33000 Feb 1996 17.88482 
10 D 114.0228 22.37167 Feb 1996 13.80029 

participant 
    id lon.p lat.p 
1 1 114.3608 22.44500 
2 2 114.1850 22.33000 
3 3 114.1581 22.28528 
4 4 114.1683 22.37167 

最後,我認爲這是我想回報什麼。 (但填寫的數值)

id lon.p  lat.p Apr 1996 Feb 1996 Jan 1996 Jun 1996 Mar 1996 May 1996 
1 1 114.3608 22.44500 
2 2 114.1850 22.33000 
3 3 114.1581 22.28528 
4 4 114.1683 22.37167 

謝謝。

+0

您有'參與者$ id = c(1,2,3,4)'和最終數據集的'id'作爲'A,B,C,D'。它爲什麼改變? – akash87

+0

這是一個錯誤。剛編輯它。謝謝 – cyrusjan

回答

0

這裏有一個方法可以在幾個步驟中完成。請注意,我創建了一個naive_dist函數,就像距離度量的佔位符一樣。該功能來自here

naive_dist <- function(long1, lat1, long2, lat2) { 
    R <- 6371 # Earth mean radius [km] 
    d <- acos(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2) * cos(long2-long1)) * R 
    return(d) # Distance in km 
} 

dist_by_id <- by(participant, participant$id, FUN = function(x) 
    #you would use your distance metric here 
    naive_dist(long1 = x$lon.p, long2 = df$lon, lat1 = x$lat.p, lat2 = df$lat) 
) 

#function to find the min for each yearmon, by id 
find_min <- function(id, data, by_data){ 
    data$dist_column = by_data[[id]] 
    by(data, data$yearmon, FUN = function(x) x[which.min(x$dist_column),]$value) 
} 
#initialize 
participant[,4:9] = 0 
names(participant)[4:9] = as.character(unique(df$yearmon)) 
#use a for loop to fill in the values 
for(i in 1:4){ 
participant[i,4:9] = stack(find_min(id = i, data = df, by_data = dist_by_id))[,1] 
} 

participant 

    id lon.p lat.p Jan 1996 Feb 1996 Mar 1996 Apr 1996 May 1996 Jun 1996 
1 1 114.3608 22.44500 17.36620 18.88409 19.53951 19.35646 13.00518 18.45556 
2 2 114.1850 22.33000 17.36620 18.88409 19.53951 19.35646 13.00518 18.45556 
3 3 114.1581 22.28528 18.57447 13.85192 17.52038  NA 16.14562 18.06435 
4 4 114.1683 22.37167 17.36620 18.88409 19.53951 19.35646 13.00518 18.45556 

顯然,一旦您更改距離度量標準,這些結果可能會改變。

另外,這裏有一個選項,使用dplyr,我傾向於更喜歡這個解決方案,因爲它可能更高性能。

library(dplyr) 
df2 <- merge(df, participant, all = T) #merge the df's 
#calculate distance 
df2$distance <- naive_dist(long1 = df2$lon, lat1 = df2$lat, 
          long2 = df2$lon.p, lat2 = df2$lat.p) 


df3 <- df2 %>% 
    group_by(yearmon, id) %>% 
    filter(distance == min(distance)) %>% 
    select(id, yearmon, value) 

participant2 <- participant 
participant2[,4:9] <- 0 
names(participant2)[4:9] <- as.character(unique(df$yearmon)) 

for(i in 1:4){ 
    participant2[i,4:9] = c(subset(df3, id == i)$value) 
} 

participant2 

    id lon.p lat.p Jan 1996 Feb 1996 Mar 1996 Apr 1996 May 1996 Jun 1996 
1 1 114.3608 22.44500 19.53951 18.88409 13.00518 17.36620 18.45556 19.35646 
2 2 114.1850 22.33000 19.53951 18.88409 13.00518 17.36620 18.45556 19.35646 
3 3 114.1581 22.28528 17.52038 13.85192 16.14562 18.57447 18.06435  NA 
4 4 114.1683 22.37167 19.53951 18.88409 13.00518 17.36620 18.45556 19.35646