匹配從一個數據集到參考數據集的行（R）

我有一個問題。假設我有兩個數據框。匹配從一個數據集到參考數據集的行（R）

values <- data.frame(x = rnorm(10000), y = rnorm(10000), matches = 0) 
reference <- data.frame(a = rnorm(10000), b = rnorm(10000))

對於「值」中的每一行，我想知道在定義範圍內「參考」數據集中有多少匹配。

system.time(

for (i in 1:nrow(values)) 
{ 
# defining valid range  
x1 <- values$x[i] - 0.1 
x2 <- values$x[i] + 0.1 
y1 <- values$y[i] - 0.2 
y2 <- values$y[i] + 0.2 

#matching values versus reference dataset 
values$matches[i] <- nrow(reference[reference$a %between% c(x1,x2) & reference$b %between% c(y1,y2),]) 
} 

) 


user system elapsed 
9.91 0.03 9.94

上面的例子是功能性的，但對於大型數據集它需要的時間。也許這可以用data.table來完成？

預先感謝您

來源

2016-04-20 Beginner

似乎您已經使用'data.table'，如「％之間％」是不是在基礎R的操作者。你可能想爲你的問題添加'data.table'標籤。 – lmo

什麼是你的實際數據的暗淡？總是會出現'nrow（values）== nrow（reference）'？有2列，或者你可能需要在c（z1，z2）＆...'之間的c（y1，y2）＆參考$ c之間的c（x1，x2）和參考$ b之間引用$ a？ –

這裏是一個data.table方法：

# set of data.tables 
values <- setDT(data.frame(x = rnorm(10000), y = rnorm(10000), matches = 0)) 
reference <- setDT(data.frame(a = rnorm(10000), b = rnorm(10000))) 
# calculate sum of ranges, initialize matches variable as integer for speed 
values[, matches := integer(nrow(values))] 

values[, matches := sum(reference$a %between% c(x-0.1, x+0.1) * 
         reference$b %between% c(y-0.2, y+0.2)), by=rownames(values)]

它可能比你所擁有的速度更快，但可能有一個更快的方法。

來源

2016-04-20 11:56:15 lmo

這是另一個使用dplyr的rowwise（）的解決方案。如果「定義的範圍」是對稱的則可以通過僅檢查兩個條件提高性能：

count_matches <- function(x, y) { 
    sum(abs(reference$a - x) <= 0.1 & abs(reference$b - y) <= 0.2) 
} 

library(dplyr) 
values %>% 
    rowwise() %>% 
    mutate(matches = count_matches(x, y))

來源

2016-04-20 12:57:12 MarkusN

匹配從一個數據集到參考數據集的行（R）

回答

相關問題