在data.frame中有效定位組合常數列

如何從數據框中有效地提取組合常數列？我已經在下面包含了一個plyr實現來明確我想要做什麼，但是它很慢。我怎樣才能儘可能有效地做到這一點？（理想情況下，根本不分割數據幀）。在data.frame中有效定位組合常數列

base <- data.frame(group = 1:1000, a = sample(1000), b = sample(1000)) 
df <- data.frame(
    base[rep(seq_len(nrow(base)), length = 1e6), ], 
    c = runif(1e6), 
    d = runif(1e6) 
) 


is.constant <- function(x) length(unique(x)) == 1 
constant_cols <- function(x) head(Filter(is.constant, x), 1) 
system.time(constant <- ddply(df, "group", constant_cols)) 
# user system elapsed 
# 20.531 1.670 22.378 
stopifnot(identical(names(constant), c("group", "a", "b"))) 
stopifnot(nrow(constant) == 1000)

在我的實際使用情況下（在內心深處GGPLOT2）有可能是恆定的，非恆列的任意數量。示例中數據的大小約爲正確的數量級。

來源

2011-12-27 hadley

你已經這樣做比使用plyr任何純-R實現更好的。恕我直言，你只能通過按組排序df（相當快）然後掃描C代碼中斷來做更好的事情。 – 2011-12-27 19:31:00

@Simon我比任何基於plyr的行解決方案都做得更好 - 我覺得應該有一個基於列的解決方案。 – hadley 2011-12-28 15:04:25

通過@ Joran的回答啓發，這裏有類似的策略，這是一個快一點（1 s和1.5秒，我的機器上）

changed <- function(x) c(TRUE, x[-1] != x[-n]) 

constant_cols2 <- function(df,grp){ 
    df <- df[order(df[,grp]),] 
    n <- nrow(df) 
    changes <- lapply(df, changed) 

    vapply(changes[-1], identical, changes[[1]], FUN.VALUE = logical(1)) 
} 
system.time(cols <- constant_cols2(df, "group")) # about 1 s 

system.time(constant <- df[changed(df$group), cols]) 
# user system elapsed 
# 1.057 0.230 1.314 

stopifnot(identical(names(constant), c("group", "a", "b"))) 
stopifnot(nrow(constant) == 1000)

它具有相同的缺陷，雖然，因爲它不會檢測列那正在對相鄰組

相同的值（例如，df$f <- 1）隨着多一點思考加上@大衛的想法：

constant_cols3 <- function(df, grp) { 
    # If col == TRUE and group == FALSE, not constant 
    matching_breaks <- function(group, col) { 
    !any(col & !group) 
    } 

    n <- nrow(df) 
    changed <- function(x) c(TRUE, x[-1] != x[-n]) 

    df <- df[order(df[,grp]),] 
    changes <- lapply(df, changed) 
    vapply(changes[-1], matching_breaks, group = changes[[1]], 
    FUN.VALUE = logical(1)) 
} 

system.time(x <- constant_cols3(df, "group")) 
# user system elapsed 
# 1.086 0.221 1.413

這給出正確的結果。

來源

2011-12-27 17:53:09 hadley

我剛想到，您可能能夠通過添加「0：1」向量來解決相鄰組問題中的相同值每一列與'group'一起重複，然後再做'rle'。 – joran 2011-12-27 20:27:25

嗯，似乎更快，如果不是排序數據框，我排序單個列，因爲我計算的變化。 – hadley 2011-12-28 15:27:57

（編輯：更好的答案）

什麼像

is.constant<-function(x) length(which(x==x[1])) == length(x)

這似乎是一個不錯的改進。比較以下內容。

> a<-rnorm(5000000) 

> system.time(is.constant(a)) 
    user system elapsed 
    0.039 0.010 0.048 
> 
> system.time(is.constantOld(a)) 
    user system elapsed 
    1.049 0.084 1.125

來源

2011-12-27 17:08:49 jebyrnes

啊，但在舊代碼中插入，is.constant似乎不是瓶頸。 Hrm ...但是，每一點都有幫助，呃？ – jebyrnes 2011-12-27 17:26:20

我會認爲'is.constant < - function（x）！any（x [1]！= x）'會更好。但你說得對，這不是瓶頸 - 這是緩慢的數據幀的分裂和組合。 – hadley 2011-12-27 17:29:38

（編輯，以應對可能具有相同值的連續組的問題）

我試探性地提交這個答案，但我還沒有完全相信自己，它會正確識別內在所有情況下都是組常量列。但它肯定更快（大概可以提高）：

constant_cols1 <- function(df,grp){ 
    df <- df[order(df[,grp]),] 

    #Adjust values based on max diff in data 
    rle_group <- rle(df[,grp]) 
    vec <- rep(rep(c(0,ceiling(diff(range(df)))), 
       length.out = length(rle_group$lengths)), 
       times = rle_group$lengths) 
    m <- matrix(vec,nrow = length(vec),ncol = ncol(df)-1) 
    df_new <- df 
    df_new[,-1] <- df[,-1] + m 

    rles <- lapply(df_new,FUN = rle) 
    nms <- names(rles) 
    tmp <- sapply(rles[nms != grp], 
        FUN = function(x){identical(x$lengths,rles[[grp]]$lengths)}) 
    return(tmp) 
}

我的基本想法是使用rle，效果顯着。

來源

2011-12-27 17:28:42 joran

嗯，我認爲如果這個數值在多個組中是相同的（例如長度是2000），那麼它就不起作用。真的很有趣的做法，雖然 – hadley 2011-12-27 17:44:56

@hadley Drat，你說得對。 – joran 2011-12-27 17:59:42

我認爲應該更容易在我的方法中修復，其工作方式與您的方法類似，但使用邏輯向量 – hadley 2011-12-27 18:14:10

我不確定這是否正是您正在尋找的內容，但它確定了列a和b。

require(data.table) 
is.constant <- function(x) identical(var(x), 0) 
dtOne <- data.table(df) 
system.time({dtTwo <- dtOne[, lapply(.SD, is.constant), by=group] 
result <- apply(X=dtTwo[, list(a, b, c, d)], 2, all) 
result <- result[result == TRUE] }) 
stopifnot(identical(names(result), c("a", "b"))) 
result

來源

2011-12-27 21:12:54 Jared

不幸的是，我試圖用盡可能少的外部依賴來做到這點，但是這確實給了我一個時間來瞄準：在我的電腦上0.5秒。 – hadley 2011-12-27 21:43:10

我試着用aggregate和by做同樣的事情，他們大約10和18秒，而不是0.3秒data.table花了。 – Jared 2011-12-28 03:53:26

是的，因爲一個很大的瓶頸是數據框的子集 - 它很慢，因爲它創建了一個副本。數據表不這樣做，所以它很快。 – hadley 2011-12-28 15:01:53

is.unsorted(x)對非常數x失效有多快？可悲的是，我目前無法訪問R。也似乎不是你的瓶頸，但。

來源

2011-12-28 00:17:02 dholstius

一點比哈德利上面所建議的速度較慢，但我認爲它應該處理等於相鄰的基團

findBreaks <- function(x) cumsum(rle(x)$lengths) 

constantGroups <- function(d, groupColIndex=1) { 
    d <- d[order(d[, groupColIndex]), ] 
    breaks <- lapply(d, findBreaks) 
    groupBreaks <- breaks[[groupColIndex]] 
    numBreaks <- length(groupBreaks) 
    isSubset <- function(x) length(x) <= numBreaks && length(setdiff(x, groupBreaks)) == 0 
    unlist(lapply(breaks[-groupColIndex], isSubset)) 
}

的直覺是的情況下，如果一個列是恆定的GroupWise然後列中的值的中斷（按組值排序）將是組值中斷的子集。

現在，隨着哈德利的它比較（與小的修改，以確保n定義）

# df defined as in the question 

n <- nrow(df) 
changed <- function(x) c(TRUE, x[-1] != x[-n]) 

constant_cols2 <- function(df,grp){ 
    df <- df[order(df[,grp]),] 
    changes <- lapply(df, changed) 
    vapply(changes[-1], identical, changes[[1]], FUN.VALUE = logical(1)) 
} 

> system.time(constant_cols2(df, 1)) 
    user system elapsed 
    1.779 0.075 1.869 
> system.time(constantGroups(df)) 
    user system elapsed 
    2.503 0.126 2.614 
> df$f <- 1 
> constant_cols2(df, 1) 
    a  b  c  d  f 
TRUE TRUE FALSE FALSE FALSE 
> constantGroups(df) 
    a  b  c  d  f 
TRUE TRUE FALSE FALSE TRUE

來源

2011-12-28 08:39:37

不錯！我認爲可以使我的版本適應與您的版本相同的策略，所以它可以保持更快一點。 – hadley 2011-12-28 15:03:32

只是通過回答使用與您的想法相同的思路，但與邏輯向量。謝謝！ – hadley 2011-12-28 15:18:27

在data.frame中有效定位組合常數列

回答

相關問題