有效地將x/y座標列表轉換爲R中的數據幀

我有一個長度爲30'000的列表，列表中有數據幀，它們有x和y列。數據幀是稀疏的，所以不存在x的每個值。所有x值都介於1和200之間。有效地將x/y座標列表轉換爲R中的數據幀

我想將此列表轉換爲單個數據框，其中每個可能的x值都有一列，每行應該表示列表條目的所有y值（如果ax值爲不存在，條目應該是0）。我有一個可行的解決方案（見下文），但它非常非常慢，我認爲必須有一個更快（也可能更優雅的方式）才能這樣做。

我當前的解決方案（這是慢）爲：

dat <- matrix(numeric(0), 30000, 200) 
for(i in seq(along=whaledatas)) { 
    for(j in row.names(whaledatas[[i]])) 
     dat[i, whaledatas[[i]][j,"x"]] <- whaledatas[[i]][j,"y"] 
} 

dfData <- data.frame(dat, files$label) 
dfData[is.na(dfData)] <- 0

來源

2013-03-11 leo

如果我正確地讀這篇文章，你可以用這個成語'做.call（rbind，whaledatas'）將'data.frames'的'list'轉換爲一個'data.frame'。 – Justin 2013-03-11 20:45:31

當你說數值在1和200之間時，這些整數值是否只有？ – mnel 2013-03-11 22:36:32

這裏的一個答案，它利用合理量時間：

# function to create dummy data 
my_sampler <- function(idx) { 
    x <- sample(200, sample(50:100, 1)) 
    y <- sample(length(x)) 
    data.frame(x,y) 
} 

# create list of 30000 data.frames 
in.d <- lapply(1:30000, function(x) my_sampler(x))

解決方案：使用data.table

require(data.table) 
system.time(out.d <- do.call(rbind, lapply(in.d, function(x) { 
    setattr(x, 'class', c("data.table", "data.frame")) # mnel's suggestion 
    setkey(x, "x") 
    x[J(1:200)]$y 
}))) 

# user system elapsed 
# 47.111 0.343 51.283 

> dim(out.d) 
# [1] 30000 200 

# final step: replace NA with 0 
out.d[is.na(out.d)] <- 0

編輯：作爲@regetz所示，分配最終基質，然後與y值替換選定的條目，其中x是發生聰明！的@ regetz的溶液中的微小變化：

m <- matrix(0.0, nrow=30000, ncol=200) 
system.time(for(i in 1:nrow(m)) { 
    m[i, in.d[[i]][["x"]]] <- in.d[[i]][["y"]] 
}) 

# user system elapsed 
# 1.496 0.003 1.511

這似乎是速度甚至比@ regetz的（如下圖所示）：

> system.time(dat <- datify(in.d, xmax=200)) 
# user system elapsed 
# 2.966 0.015 2.993

來源

2013-03-11 21:37:58 Arun

@mnel，感謝'setattr'。我編輯了代碼。雖然我看不到性能差異（51秒）。 – Arun 2013-03-11 22:34:01

我使用'do.call（rbind，..）'，因爲我正在返回一個矢量。 'rbindlist'需要data.frame/data.table/list。我這樣做是因爲我想直接獲得30000 * 200的矩陣。通過執行'rbindlist'，我最終得到了一個'data.table'（兩列，x和y），我需要從中重新創建一個矩陣。沒有性能收益。單獨創建data.table需要51秒。 – Arun 2013-03-11 22:38:54

'setattr'避免了一個副本，這是一件好事，它也是瞬時的，所以會用更大的數據進行縮放。我不相信我真的明白這個問題，也許'rbindlist - >重塑到廣泛是OP的後面。 – mnel 2013-03-11 22:43:15

我會用一個data.table的解決方案，這樣的事情：

whaledatas <- lapply(1:30000,function(x)data.frame(x=1:200,y=1:200)) 
library(data.table) 
dtt <- rbindlist(whaledatas)

來源

2013-03-11 21:05:12 agstudy

以及缺失的值如何？他預計30000 * 200數據幀/矩陣。 – Arun 2013-03-11 21:08:06

@Arun我不知道'rbindList'會如何處理NA值。我選擇它是因爲它比「do.call」快。 – agstudy 2013-03-11 21:15:30

首先，這裏是清單的一個小例子數據幀：

# create some sample data 
whaledatas <- list(
    data.frame(x=1:3, y=11:13), 
    data.frame(x=6:10, y=16:20) 
)

我覺得這個和是一樣的在原來的問題10循環？

# combine into single data frame 
whaledatas.all <- do.call("rbind", whaledatas) 

# change this to 200! kept small here for illustration... 
XMAX <- 10 

# create output matrix 
dat <- matrix(0.0, length(whaledatas), XMAX) 

# create index vector for dat rows 
i <- rep(1:length(whaledatas), sapply(whaledatas, nrow)) 

# populate dat 
dat[cbind(i, whaledatas.all[["x"]])] <- whaledatas.all[["y"]]

編輯

的rbind得到作爲輸入的數量增加窘況慢。這個版本（包裝在方便的功能）避免它，並且運行速度更快：

datify <- function(x, xmax=200) { 
    dat <- matrix(0.0, length(x), xmax) 
    for (i in seq_along(x)) { 
     this.df <- x[[i]] 
     coords <- cbind(rep(i, nrow(this.df)), this.df[["x"]]) 
     dat[coords] <- this.df[["y"]] 
    } 
    dat 
}

請注意，我們在dat開始全部爲零，因此沒有必要修復後的事實...

> datify(whaledatas, xmax=10) 
    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] 
[1,] 11 12 13 0 0 0 0 0 0  0 
[2,] 0 0 0 0 0 16 17 18 19 20

定時採樣數據幀30k的長度列表，生成使用Arun的my_sampler功能：

set.seed(99) 
in.d <- lapply(1:30000, function(x) my_sampler(x)) 
system.time(dat <- datify(in.d, xmax=200)) 
## user system elapsed 
## 1.317 0.011 1.328

來源

2013-03-11 21:15:07 regetz

FWIW，使用Arun非常有用的my_sampler函數 – regetz 2013-03-11 22:37:07

，您可以在30000個數據上運行'system.time（。）'您的代碼，這個'datify'函數在<30秒的樣本數據幀列表中運行<2秒。幀？如果您願意，您可以使用我的代碼中的功能創建一個。 – Arun 2013-03-11 22:42:13

@阿倫：謝謝你的建議（和功能）。我在我的答案中添加了時間。 – regetz 2013-03-11 22:51:05

有效地將x/y座標列表轉換爲R中的數據幀

回答

相關問題