2011-04-13 125 views
6

我想寫的行爲如下的功能,但它被證明是非常困難的:拆分數據幀分成重疊dataframes

DF <- data.frame(x = seq(1,10), y = rep(c('a','b','c','d','e'),2)) 
> DF 
    x y 
1 1 a 
2 2 b 
3 3 c 
4 4 d 
5 5 e 
6 6 a 
7 7 b 
8 8 c 
9 9 d 
10 10 e 

>OverLapSplit(DF,nsplits=2,overlap=2) 
[[1]] 
    x y 
1 1 a 
2 2 b 
3 3 c 
4 4 d 
5 5 e 
6 6 a 

[[2]] 
    x y 
1 5 a 
2 6 b 
3 7 c 
4 8 d 
5 9 e 
6 10 a 

>OverLapSplit(DF,nsplits=1) 
[[1]] 
    x y 
1 1 a 
2 2 b 
3 3 c 
4 4 d 
5 5 e 
6 6 a 
7 7 b 
8 8 c 
9 9 d 
10 10 e 

>OverLapSplit(DF,nsplits=2,overlap=4) 
[[1]] 
    x y 
1 1 a 
2 2 b 
3 3 c 
4 4 d 
5 5 e 
6 6 a 
7 7 b 

[[2]] 
    x y 
1 4 e 
2 5 a 
3 6 b 
4 7 c 
5 8 d 
6 9 e 
7 10 a 

>OverLapSplit(DF,nsplits=5,overlap=1) 
[[1]] 
    x y 
1 1 a 
2 2 b 
3 3 c 

[[2]] 
    x y 
1 3 c 
2 4 d 
3 5 e 

[[3]] 
    x y 
1 5 e 
2 6 a 
3 7 b 

[[4]] 
    x y 
1 7 b 
2 8 c 
3 9 d 

[[5]] 
    x y 
1 8 d 
2 9 e 
3 10 f 

我還沒想了很多,如果你會發生什麼想是這樣OverLapSplit(DF,nsplits=2,overlap=1)

也許以下幾點:

[[1]] 
    x y 
1 1 a 
2 2 b 
3 3 c 
4 4 d 
5 5 e 

[[2]] 
    x y 
1 5 a 
2 6 b 
3 7 c 
4 8 d 
5 9 e 
6 10 a 

謝謝!

+0

那麼這個函數是否存在,或者你不知道如何處理邊界情況? – Chase 2011-04-13 18:36:07

+0

@相關函數不存在。如果我得到一個可行的(不雅)版本編碼,我會發布它。 – Zach 2011-04-13 18:49:21

+0

@Zach是你的Q Qapropos_嗎? http://stackoverflow.com/q/5652058/429846 – 2011-04-13 19:33:45

回答

6

試着這麼做:

OverlapSplit <- function(x,nsplit=1,overlap=2){ 
    nrows <- NROW(x) 
    nperdf <- ceiling((nrows + overlap*nsplit)/(nsplit+1)) 
    start <- seq(1, nsplit*(nperdf-overlap)+1, by= nperdf-overlap) 

    if(start[nsplit+1] + nperdf != nrows) 
     warning("Returning an incomplete dataframe.") 

    lapply(start, function(i) x[c(i:(i+nperdf-1)),]) 
} 

與nsplit分割的數量! (nsplit = 1返回2個數據幀)。如果重疊分割不能真正適合數據框,這會渲染不完整的最後一個數據幀,併發出警告。

> OverlapSplit(DF,nsplit=3,overlap=2) 
[[1]] 
    x y 
1 1 a 
2 2 b 
3 3 c 
4 4 d 

[[2]] 
    x y 
3 3 c 
4 4 d 
5 5 e 
6 6 a 

[[3]] 
    x y 
5 5 e 
6 6 a 
7 7 b 
8 8 c 

[[4]] 
    x y 
7 7 b 
8 8 c 
9 9 d 
10 10 e 

,一個具有警示

> OverlapSplit(DF,nsplit=1,overlap=1) 
[[1]] 
    x y 
1 1 a 
2 2 b 
3 3 c 
4 4 d 
5 5 e 
6 6 a 

[[2]] 
    x y 
6 6 a 
7 7 b 
8 8 c 
9 9 d 
10 10 e 
NA NA <NA> 

Warning message: 
In OverlapSplit(DF, nsplit = 1, overlap = 1) : 
    Returning an incomplete dataframe. 
+0

+1從第一原則的不錯答案---我也[懶惰|愚蠢的] *爲第一個原則。 [*刪除適用] ;-) – 2011-04-13 20:10:22

+0

@加文辛普森:我發佈了我自己的答案與我想到的完整工作流程。有絕對的改善空間,但我認爲它現在可以滿足我的需求。感謝所有的建議! – Zach 2011-04-13 21:13:25

+0

@Joris Meys你會如何去解決不包括「不完整」重疊數據框的問題(即,只會越過一個警告) – 2015-12-02 20:20:21

4

這使用萊迪思圖形掛牌的想法,因此從包裝lattice生成間隔利用代碼,然後使用一個循環來打破原有的DF進入正確的子集。

我並不完全確定overlap = 1是什麼意思 - 我認爲你的意思是重複1樣本/觀察。如果是這樣,下面的代碼會這樣做。

OverlapSplit <- function(x, nsplits = 1, overlap = 0) { 
    stopifnot(require(lattice)) 
    N <- seq_len(nr <- nrow(x)) 
    interv <- co.intervals(N, nsplits, overlap/nr) 
    out <- vector(mode = "list", length = nrow(interv)) 
    for(i in seq_along(out)) { 
     out[[i]] <- x[interv[i,1] < N & N < interv[i,2], , drop = FALSE] 
    } 
    out 
} 

其中給出:

> OverlapSplit(DF, 2, 2) 
[[1]] 
    x y 
1 1 a 
2 2 b 
3 3 c 
4 4 d 
5 5 e 
6 6 a 

[[2]] 
    x y 
5 5 e 
6 6 a 
7 7 b 
8 8 c 
9 9 d 
10 10 e 

> OverlapSplit(DF) 
[[1]] 
    x y 
1 1 a 
2 2 b 
3 3 c 
4 4 d 
5 5 e 
6 6 a 
7 7 b 
8 8 c 
9 9 d 
10 10 e 

> OverlapSplit(DF, 4, 1) 
[[1]] 
    x y 
1 1 a 
2 2 b 
3 3 c 

[[2]] 
    x y 
3 3 c 
4 4 d 
5 5 e 

[[3]] 
    x y 
6 6 a 
7 7 b 
8 8 c 

[[4]] 
    x y 
8 8 c 
9 9 d 
10 10 e 
+0

只要注意'overlap'的定義; 'co.intervals()'想要重疊部分而不是重疊樣本的絕對數量,所以在某些情況下可能會出現舍入問題。如果發生這種情況,你會得到比你想要的更少/更多重疊 – 2011-04-13 20:04:50

+0

+1 woo-yeah!從來沒有想過黑客行爲會爲我做這件事。好一個 – 2011-04-13 20:22:20

0

只是爲了說清楚我在做什麼在這裏:

#Load Libraries 
library(PerformanceAnalytics) 
library(quantmod) 

#Function to Split Data Frame 
OverlapSplit <- function(x,nsplit=1,overlap=0){ 
    nrows <- NROW(x) 
    nperdf <- ceiling((nrows + overlap*nsplit)/(nsplit+1)) 
    start <- seq(1, nsplit*(nperdf-overlap)+1, by= nperdf-overlap) 

    if(start[nsplit+1] + nperdf != nrows) 
     warning("Returning an incomplete dataframe.") 

    lapply(start, function(i) x[c(i:(i+nperdf-1)),]) 
} 

#Function to run regression on 30 days to predict the next day 
FL <- as.formula(Next(HAM1)~HAM1+HAM2+HAM3+HAM4) 
MyRegression <- function(df,FL) { 
    df <- as.data.frame(df) 
    model <- lm(FL,data=df[1:30,]) 
    predict(model,newdata=df[31,]) 
} 

#Function to roll the regression 
RollMyRegression <- function(data,ModelFUN,FL) { 
    rollapply(data, width=31,FUN=ModelFUN,FL, 
    by.column = FALSE, align = "right", na.pad = FALSE) 
} 

#Load Data 
data(managers) 

#Split Dataset 
split.data <- OverlapSplit(managers,2,30) 
sapply(split.data,dim) 

#Run rolling regression on each split 
output <- lapply(split.data,RollMyRegression,MyRegression,FL) 
output 
unlist(output) 

通過這種方式,你可以在結束與並行取代lapply拉普利版本,並提高你的速度。

當然,考慮到處理器的數量和數據集的大小,現在存在優化拆分/重疊的問題。