查找完全相關/冗餘的數字和字符列

我有一個包含數百列的數據集。它包含郵件列表數據，而且幾列似乎是相互重疊的，但形式不同。查找完全相關/冗餘的數字和字符列

例如：

rowNum StateCode  StateName  StateAbbreviation 
    1   01    UTAH    UT 
    2   01    UTAH    UT 
    3   03    TEXAS    TX 
    4   03    TEXAS    TX 
    5   03    TEXAS    TX 
    6   44    OHIO    OH 
    7   44    OHIO    OH 
    8   44    OHIO    OH 
...   ...   ...    ...

我想去掉重疊數據，並只保留數字列如果可能的話那麼只有一列包含相同的信息。因此，上面的例子將成爲：

rowNum StateCode 
     1   01 
     2   01 
     3   03 
     4   03 
     5   03 
     6   44 
     7   44 
     8   44 
    ...   ...

我一直在使用cor()但數值變量這隻能嘗試。我試過caret::nearZeroVar()但這隻適用於列本身。

有沒有人有任何建議找到完全相關的列涉及非數字數據？

謝謝。

來源

2012-09-04 screechOwl

剛剛編輯我的答案，以簡化其方法。它現在使用'cor（）'，我當然應該從你的問題中找到開始。感謝這個很酷的問題。 –

@ JoshO'Brien：效果很好。非常感謝你。 – screechOwl

這是一個有趣且快速的解決方案。它首先將data.frame轉換爲適當結構的整數類矩陣，然後使用cor()來標識冗餘列。

## Read in the data 
df <- read.table(text="rowNum StateCode  StateName  StateAbbreviation 
    1   01    UTAH    UT 
    2   01    UTAH    UT 
    3   03    TEXAS    TX 
    4   03    TEXAS    TX 
    5   03    TEXAS    TX 
    6   44    OHIO    OH 
    7   44    OHIO    OH 
    8   44    OHIO    OH", header=TRUE) 

## Convert data.frame to a matrix with a convenient structure 
## (have a look at m to see where this is headed) 
l <- lapply(df, function(X) as.numeric(factor(X, levels=unique(X)))) 
m <- as.matrix(data.frame(l)) 

## Identify pairs of perfectly correlated columns  
M <- (cor(m,m)==1) 
M[lower.tri(M, diag=TRUE)] <- FALSE 

## Extract the names of the redundant columns 
colnames(M)[colSums(M)>0] 
[1] "StateName"   "StateAbbreviation"

來源

2012-09-04 22:51:49

dat <- read.table(text="rowNum StateCode  StateName  
    1   01    UTAH 
    2   01    UTAH 
    3   03    TEXAS 
    4   03    TEXAS 
    5   03    TEXAS 
    6   44    OHIO 
    7   44    OHIO 
    8   44    OHIO", header=TRUE) 

dat [!duplicated(dat[, 2:3]), ] 
#------------ 
    rowNum StateCode StateName 
1  1   1  UTAH 
3  3   3  TEXAS 
6  6  44  OHIO

來源

2012-09-04 22:25:52

問題是詢問重複列，而不是行。 – Marius

@Marius：如果你是-1票的來源，那麼讓我問你，如果你認爲減少答案是合理的，那麼在OP將問題改變成別的東西而不是它開始的時候呢？當我發佈這個答案時，沒有「StateAbbreviation」列，問題中沒有「正確答案」的例子。我並不擔心我的分數，但我認爲當問題發生變化時，這是一個糟糕的公民。 –

我同意 - 這裏downvote沒有任何用處，只是讓它成爲一個不太友善的地方。 –

這會做詭計嗎？我立足其關閉的想法，如果你調用table(col1, col2)，表中的任何列將只有一個非零值，如果列是重複的，例如：

 OHIO TEXAS UTAH 
    1  0  0 2 
    3  0  3 0 
    44 3  0 0

因此，像這樣：

dup.cols <- read.table(text='rowNum StateCode  StateName  StateAbbreviation 
    1   01    UTAH    UT 
    2   01    UTAH    UT 
    3   03    TEXAS    TX 
    4   03    TEXAS    TX 
    5   03    TEXAS    TX 
    6   44    OHIO    OH 
    7   44    OHIO    OH 
    8   44    OHIO    OH', header=T) 
library(plyr) 
combs <- combn(ncol(dup.cols), 2) 
adply(combs, 2, function(x) { 
    t <- table(dup.cols[ ,x[1]], dup.cols[ , x[2]]) 
    if (all(aaply(t1, 2, function(x) {sum(x != 0) == 1}))) { 
    paste("Column numbers ", x[1], x[2], "are duplicates") 
    } 
})

來源

2012-09-04 22:51:52 Marius

這應該會爲您返回一張地圖，告訴您哪些變量相互匹配。

check.dup <- expand.grid(names(dat),names(dat)) #find all variable pairs 
check.dup[check.dup$Var1 != check.dup$Var2,] #take out self-reference 
check.dup$id <- mapply(function(x,y) { 
     x <- as.character(x); y <- as.character(y) 
      #if number of levels is different, discard; keep the number for later 
     if ((n <- length(unique(dat[,x]))) != length(unique(dat[,y]))) { 
      return(FALSE) 
      } 
      #subset just the variables in question to get pairs 
     d <- dat[,c(x,y)] 
      #find unique pairs 
     d <- unique(d) 
      #if number of unique pairs is the number of levels from before, 
      #then the pairings are one-to-one 
     if(nrow(d) == n) { 
      return(TRUE) 
     } else return(FALSE) 
    }, 
    check.dup$Var1, 
    check.dup$Var2 
)

來源

2012-09-04 22:52:56

查找完全相關/冗餘的數字和字符列

回答

相關問題