合併重複列名稱

我有一個數據框，其中一些列具有相同的數據，但列名不同。我想刪除重複的列，但合併列名稱。一個例子，在測試1和TEST4列是重複的：合併重複列名稱

df 

     test1 test2 test3 test4 
    1  1  1  0  1 
    2  2  2  2  2 
    3  3  4  4  3 
    4  4  4  4  4 
    5  5  5  5  5 
    6  6  6  6  6

，我想結果是這樣的：

下面是數據：

structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4, 
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4, 
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA, 
-6L), class = "data.frame")

請請注意，我不只是想刪除重複的列。我也想在刪除重複項後合併重複列的列名。

我可以手動爲我發佈的簡單表格做這件事，但是我想在大型數據集上使用它，但事先並不知道哪些列是相同的。我不會手動刪除和重命名列，因爲我可能有超過50個重複的列。

來源

2017-03-27 arielle

我們必須假設你用Google搜索「R刪除重複列」。請說明爲什麼前幾個命中沒有幫助。否則，這個問題將作爲重複被關閉。 – Henrik

是的，我有。請查看結果表中的列名。我不僅想刪除重複的列。我也想在刪除重複項後合併重複列的列名。我可以手動完成我發佈的簡單表格，但我想在大型數據集上使用它。 – arielle

您是否事先知道哪些列是重複的？或者你想自動確定 – MichaelChirico

好的，使用從here的想法改進上述答案。將重複和非重複的列保存到數據框中。檢查非重複項是否與任何重複項匹配，如果是，則連接它們的列名。所以，如果你有兩個以上的重複列，現在就可以工作。

編輯：更改summary到digest。這有助於角色數據。

df <- structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4, 
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4, 
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA, 
-6L), class = "data.frame") 

library(digest) 
nondups <- df[!duplicated(lapply(df, digest))] 

dups <- df[duplicated(lapply(df, digest))] 

for(i in 1:ncol(nondups)){ 
    for(j in 1:ncol(dups)){ 
    if(FALSE %in% paste0(nondups[,i] == dups[,j])) NULL 
    else names(nondups)[i] <- paste(names(nondups[i]), names(dups[j]), sep = "+") 
    } 
} 

nondups

例2中，作爲函數。

編輯：更改summary到digest並返回非重複和重複的數據幀。

age <- 18:29 
height <- c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5) 
gender <- c("M","F","M","M","F","F","M","M","F","M","F","M") 
testframe <- data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender, gender3 = gender) 

dupcols <- function(df = testframe){ 
    nondups <- df[!duplicated(lapply(df, digest))] 

    dups <- df[duplicated(lapply(df, digest))] 

    for(i in 1:ncol(nondups)){ 
    for(j in 1:ncol(dups)){ 
     if(FALSE %in% paste0(nondups[,i] == dups[,j])) NULL 
     else names(nondups)[i] <- paste(names(nondups[i]), names(dups[j]), sep = "+") 
    } 
    } 

    return(list(df1 = nondups, df2 = dups)) 
} 

dupcols(df = testframe)

Editted：這部分是新的。

例3：在一個大的數據幀

#Creating a 1500 column by 15000 row data frame 
dat <- do.call(data.frame, replicate(1500, rep(FALSE, 15000), simplify=FALSE)) 
names(dat) <- 1:1500 

#Fill the data frame with LETTERS across the rows 
#This part may take a while. Took my PC about 23 minutes. 
start <- Sys.time() 
    fill <- rep(LETTERS, times = ceiling((15000*1500)/26)) 
    j <- 0 
    for(i in 1:nrow(dat)){ 
    dat[i,] <- fill[(1+j):(1500+j)] 
    j <- j + 1500 
    } 
difftime(Sys.time(), start, "mins") 

#Run the function on the created data set 
#This took about 4 minutes to complete on my PC. 
start <- Sys.time() 
    result <- dupcols(df = dat) 
difftime(Sys.time(), start, "mins") 

names(result$df1) 
ncol(result$df1) 
ncol(result$df2)

來源

2017-03-27 18:23:48 Jake

它似乎工作得很漂亮，非常感謝！ – arielle

我猜測這可能需要一段時間才能運行非常大的數據幀，例如15000乘1500？ – arielle

測試它。使用我提供的示例並複製數據框很多次，它仍然運行得非常快。 'dfnew <-do.call（「data.frame」，replicate（500，testframe，simplify = FALSE））; ncol（dfnew）; start < - Sys.time（）; 結果< - dupcols（df = dfnew）; difftime（Sys.time（），start，「secs」）;'列名變得相當笨拙。 – Jake

它不是完全自動化的，但循環的輸出將識別重複列對。然後，您必須刪除其中一個重複列，然後根據重複的列重新命名。

df <- structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4, 
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4, 
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA, 
-6L), class = "data.frame") 

for(i in 1:(ncol(df)-1)){ 
    for(j in 2:ncol(df)){ 
    if(i == j) NULL 
    else if(FALSE %in% paste0(df[,i] == df[,j])) NULL 
    else print(paste(i, j, sep = " + ")) 
    } 
} 

new <- df[,-4] 
names(new)[1] <- paste(names(df[1]), names(df[4]), sep = "+") 
new

來源

2017-03-27 17:55:34 Jake

這似乎是一個好的開始，但它不工作，如果有超過兩列相同的數據，因爲它會尋找所有可能的對... – arielle

而且我真的尋找一種方法，無需手動刪除和重命名列，因爲我可能有超過50個重複的列 – arielle

合併重複列名稱

回答

相關問題