將集合轉換爲R中列索引的有效方法是什麼？

給一個大的（NROWS> 5,000,000+）數據幀，甲，與串行名稱和不相交的集合的列表（N = 20,000），乙，其中每個組由行名稱A，通過唯一值創建代表集合的向量的最佳方法是什麼？B？

插圖

下面是說明該問題的一個示例：

# Input 
A <- data.frame(d = rep("A", 5e6), row.names = as.character(sample(1:5e6))) 
B <- list(c("4655297", "3177816", "3328423"), c("2911946", "2829484"), ...) # Size 20,000+

期望的結果將是：

# An index of NA represents that the row is not part of any set in B. 
> A[,"index", drop = F] 
     d index 
4655297 A  1 
3328423 A  1 
2911946 A  2 
2829484 A  2 
3871770 A NA 
2702914 A NA 
2581677 A NA 
4106410 A NA 
3755846 A NA 
3177816 A  1

樸素嘗試

這樣的事情可以用以下方法來實現。

n <- 0 
A$index <- NA 
lapply(B, function(x){ 
    n <<- n + 1 
    A[x, "index"] <<- n 
})

問題

然而，這是不合理的慢（幾個小時）由於多次索引和不是很R-式的或典雅。

如何快速高效地生成期望的結果？

來源

2012-10-23 Nixuz

這是一個建議，使用基地與當前的方法相比不算太差。

的樣本數據：

A <- data.frame(d = rep("A", 5e6), 
       set = sample(c(NA, 1:20000), 5e6, replace = TRUE), 
       row.names = as.character(sample(1:5e6))) 
B <- split(rownames(A), A$set)

基本方法：

system.time({ 
A$index <- NA 
A[unlist(B), "index"] <- rep(seq_along(B), times = lapply(B, length)) 
}) 
# user system elapsed 
# 15.30 0.19 15.50

檢查：

identical(A$set, A$index) 
# TRUE

對於任何事物都快，我想data.table會來得心應手。

來源

2012-10-23 19:37:27 flodel

謝謝。優雅而快捷！ – Nixuz

將集合轉換爲R中列索引的有效方法是什麼？

回答

相關問題