數值比較

數據幀d1：數值比較

數據幀d2：

如何合併d1和d2通過"x"其中d1$x應該對精確匹配或下一個更高的匹配號碼爲d2$x。輸出應該是這樣的：

x y z 
4 10 200 # (4 is matched against next higher value that is 6) 
6 20 200 # (6 is matched against 6) 
7 30 300 # (7 is matched against next higher value that is 9)

如果merge()不能做到這一點，那麼有沒有其他的方法來做到這一點？ For循環痛苦地緩慢。

來源

2014-06-07 user3718192

輸入數據：

d1 <- data.frame(x=c(4,6,7), y=c(10,20,30)) 
d2 <- data.frame(x=c(3,6,9), z=c(100,200,300))

基本上，你希望通過一個新列延伸d1。所以讓我們來複制它。

d3 <- d1

下一頁我認爲d2$x被nondecreasingly和max(d1$x) <= max(d2$x)排序。

d3$z <- sapply(d1$x, function(x) d2$z[which(x <= d2$x)[1]])

其內容爲：在d1$x每個x，得到d2$x最小的值不小於x小。

在這些假設下，上面也可以寫爲（&應該是快了一點）：

d3$z <- sapply(d1$x, function(x) d2$z[which.max(x <= d2$x)])

在結果我們得到：

d3 
## x y z 
## 1 4 10 200 
## 2 6 20 200 
## 3 7 30 300

EDIT1：由@靈感MatthewLundberg的cut爲基礎的解決方案，這裏是另一個使用findInterval：

d3$z <- d2$z[findInterval(d1$x, d2$x+1)+1]

EDIT2：（基準）

示例性數據：

set.seed(123) 
d1 <- data.frame(x=sort(sample(1:10000, 1000)), y=sort(sample(1:10000, 1000))) 
d2 <- data.frame(x=sort(c(sample(1:10000, 999), 10000)), z=sort(sample(1:10000, 1000)))

結果：

microbenchmark::microbenchmark(
{d3 <- d1; d3$z <- d2$z[findInterval(d1$x, d2$x+1)+1] }, 
{d3 <- d1; d3$z <- sapply(d1$x, function(x) d2$z[which(x <= d2$x)[1]]) }, 
{d3 <- d1; d3$z <- sapply(d1$x, function(x) d2$z[which.max(x <= d2$x)]) }, 
{d1$x2 <- d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]; merge(d1, d2, by.x='x2', by.y='x')}, 
{d1a <- d1; setkey(setDT(d1a), x); d2a <- d2; setkey(setDT(d2a), x); d2a[d1a, roll=-Inf] } 
) 
## Unit: microseconds 
##   expr  min   lq median  uq  max neval 
## findInterval 221.102  1357.558 1394.246 1429.767 17810.55 100 
## which  66311.738  70619.518 85170.175 87674.762 220613.09 100 
## which.max 69832.069  73225.755 83347.842 89549.326 118266.20 100 
## cut   8095.411  8347.841 8498.486 8798.226 25531.58 100 
## data.table 1668.998  1774.442 1878.028 1954.583 17974.10 100

來源

2014-06-07 16:59:14 gagolews

這是相當簡單的使用軋製加入與data.table：

require(data.table) ## >= 1.9.2 
setkey(setDT(d1), x) ## convert to data.table, set key for the column to join on 
setkey(setDT(d2), x) ## same as above 

d2[d1, roll=-Inf] 

# x z y 
# 1: 4 200 10 
# 2: 6 200 20 
# 3: 7 300 30

來源

2014-06-07 17:02:01 Arun

cut可用於查找d2$x中d1$x中值的適當匹配項。

找到匹配與cut計算如下：

as.numeric(cut(d1$x, c(-Inf, d2$x, Inf))) 
## [1] 2 2 3

這些是值：

d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))] 
[1] 6 6 9

這些可添加到d1和合並進行的：

d1$x2 <- d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))] 
merge(d1, d2, by.x='x2', by.y='x') 
## x2 x y z 
## 1 6 4 10 200 
## 2 6 6 20 200 
## 3 9 7 30 300

如果需要，可以刪除添加的列。

來源

2014-06-07 18:50:46

+1對於'cut'。另外，我想，'findInterval'將起到類似的作用。 – gagolews

@gagolews'findInterval'使用左側關閉的時間間隔。 'cut'給出了一個選擇（右邊是默認值）。 –

'rightmost.closed'？ – gagolews

請嘗試：sapply(d1$x,function(y) d2$z[d2$x > y][which.min(abs(y - d2$x[d2$x > y]))])

來源

2014-06-08 01:59:38 userNaN

回答

相關問題