子集的數據幀基於列條目（或等級）

我有一個data.frame像這樣簡單之一：子集的數據幀基於列條目（或等級）

id group idu value 
1 1  1_1 34 
2 1  2_1 23 
3 1  3_1 67 
4 2  4_2 6 
5 2  5_2 24 
6 2  6_2 45 
1 3  1_3 34 
2 3  2_3 67 
3 3  3_3 76

從那裏我想以檢索與每個組的所述第一條目的子集;例如：

id group idu value 
1 1  1_1 34 
4 2  4_2 6 
1 3  1_3 34

id不是唯一的，所以該方法不應該依賴它。

我可以實現這種避免循環？

structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L), group = c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), idu = structure(c(1L, 3L, 5L, 
7L, 8L, 9L, 2L, 4L, 6L), .Label = c("1_1", "1_3", "2_1", "2_3", 
"3_1", "3_3", "4_2", "5_2", "6_2"), class = "factor"), value = c(34L, 
23L, 67L, 6L, 24L, 45L, 34L, 67L, 76L)), .Names = c("id", "group", 
"idu", "value"), class = "data.frame", row.names = c(NA, -9L))

來源

2011-04-27 Paulo E. Cardoso

使用加文的百萬行的DF：

DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE), 
        group = factor(rep(1:1000, each = 1000)), 
        value = runif(1000000)) 
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))

我認爲最快的方法是重新排序的數據幀，然後使用duplicated：

system.time({ 
    DF4 <- DF3[order(DF3$group), ] 
    out2 <- DF4[!duplicated(DF4$group), ] 
}) 
# user system elapsed 
# 0.335 0.107 0.441

相比之下，Gavin在我的電腦上的緊固拉普利+拆分方法的時間爲7秒。

通常，使用數據框時，最快的方法通常是生成所有的索引，然後執行一個子集。

來源

2011-04-28 14:37:25 hadley

+1這是一個很好的例子。 – 2011-04-28 21:30:39

這是一個不錯的方法，但要添加一個附註，實際數據也可以重複組代碼，這需要一個額外的步驟：添加一個真實的單個組ID到整個數據集，可能基於時間戳列 – 2011-04-29 08:04:38

怎麼回事！重複返回重複組的第一個值？ – zach 2011-11-09 23:28:42

我認爲這將這樣的伎倆：

數據dput()

aggregate(data["idu"], data["group"], function (x) x[1])

爲您更新的問題，我建議你使用ddply從plyr包：

ddply(data, .(group), function (x) x[1,])

來源

2011-04-27 13:58:05

也適用。謝謝丹尼爾 – 2011-04-27 14:08:46

查看更新後的問題的答案。 – 2011-04-27 15:27:45

更新在OP的評論

的光。如果這樣做的百萬+行，從而提供所有的選項將是緩慢的。下面是一個100,000行僞數據集上的一些比較時序：

set.seed(12) 
DF3 <- data.frame(id = sample(1000, 100000, replace = TRUE), 
        group = factor(rep(1:100, each = 1000)), 
        value = runif(100000)) 
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_"))) 

> system.time(out1 <- do.call(rbind, lapply(split(DF3, DF3["group"]), `[`, 1,))) 
    user system elapsed 
19.594 0.053 19.984 
> system.time(out3 <- aggregate(DF3[,-2], DF3["group"], function (x) x[1])) 
    user system elapsed 
12.419 0.141 12.788

我放棄了做了一百萬行。遠遠快，無論你相信與否，就是：

out2 <- matrix(unlist(lapply(split(DF3[, -4], DF3["group"]), `[`, 1,)), 
       byrow = TRUE, nrow = (lev <- length(levels(DF3$group)))) 
colnames(out2) <- names(DF3)[-4] 
rownames(out2) <- seq_len(lev) 
out2 <- as.data.frame(out2) 
out2$group <- factor(out2$group) 
out2$idu <- factor(paste(out2$id, out2$group, sep = "_"), 
        levels = levels(DF3$idu))

輸出是（有效）相同：

> all.equal(out1, out2) 
[1] TRUE 
> all.equal(out1, out3[, c(2,1,3,4)]) 
[1] "Attributes: < Component 2: Modes: character, numeric >"    
[2] "Attributes: < Component 2: target is character, current is numeric >"

（out1（或out2）和out3（在aggregate()版本之間的差異）是隻是在組件的rownames）

用的時序：

user system elapsed 
    0.163 0.001 0.168

在10萬行的問題，並在此萬行的問題：

set.seed(12) 
DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE), 
        group = factor(rep(1:1000, each = 1000)), 
        value = runif(1000000)) 
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))

與

user system elapsed 
11.916 0.000 11.925

與基體版本工作（即產生out2）的定時更快做百萬行其他版本正在處理100,000行問題。這只是表明，使用矩陣的確很快，並且我的do.call()版本中的瓶頸是rbind()-結果在一起。

的百萬行的問題正與做：

system.time({out4 <- matrix(unlist(lapply(split(DF3[, -4], DF3["group"]), 
              `[`, 1,)), 
          byrow = TRUE, 
          nrow = (lev <- length(levels(DF3$group)))) 
      colnames(out4) <- names(DF3)[-4] 
      rownames(out4) <- seq_len(lev) 
      out4 <- as.data.frame(out4) 
      out4$group <- factor(out4$group) 
      out4$idu <- factor(paste(out4$id, out4$group, sep = "_"), 
           levels = levels(DF3$idu))})

原始

如果你的數據在DF，說出來的話：

do.call(rbind, lapply(with(DF, split(DF, group)), head, 1))

會做你想要什麼：

> do.call(rbind, lapply(with(DF, split(DF, group)), head, 1)) 
    idu group 
1 1  1 
2 4  2 
3 7  3

如果新的數據是DF2然後我們得到：

> do.call(rbind, lapply(with(DF2, split(DF2, group)), head, 1)) 
    id group idu value 
1 1  1 1_1 34 
2 4  2 4_2  6 
3 1  3 1_3 34

但對於速度，我們可能要子集，而不是使用head()，我們可以通過不使用with()贏得了一下，如：

do.call(rbind, lapply(split(DF2, DF2$group), `[`, 1,)) 

> system.time(replicate(1000, do.call(rbind, lapply(split(DF2, DF2$group), `[`, 1,)))) 
    user system elapsed 
    3.847 0.040 4.044 
> system.time(replicate(1000, do.call(rbind, lapply(split(DF2, DF2$group), head, 1)))) 
    user system elapsed 
    4.058 0.038 4.111 
> system.time(replicate(1000, aggregate(DF2[,-2], DF2["group"], function (x) x[1]))) 
    user system elapsed 
    3.902 0.042 4.106

來源

2011-04-27 14:00:20

似乎工作加文。我編輯了這個問題的內容，但它可能不會受到影響。我必須用2百萬行數據幀來測試它的性能。 – 2011-04-27 14:05:14

@Paulo我已經更新了答案，並在此數據集合上重複運行了一些比較時間。 – 2011-04-27 14:54:13

@Paulo Cardosa我在一個大問題上做了一些時間安排，所有選項都很慢，所以我提供了一個與矩陣配合使用的版本，速度更快。包括一百萬行問題的時間。 – 2011-04-27 16:12:14

一個使用plyr，假設你的數據在一個對象命名爲zzz解決方案：

ddply(zzz, "group", function(x) x[1 ,])

另一種選擇是採用行之間的差異，並且應該證明速度更快，但依賴於先前預定的對象。這也假設你沒有的0一組值：

zzz <- zzz[order(zzz$group) ,] 

zzz[ diff(c(0,zzz$group)) != 0, ]

來源

2011-04-27 14:07:25 Chase

子集的數據幀基於列條目（或等級）

回答

相關問題