R：刪除列基於兩個列的相似性檢查

輸入R：刪除列基於兩個列的相似性檢查

row.no column2 column3 column4 
1  bb   ee  up 
2  bb   ee  down 
3  bb   ee  up 
4  bb   yy  down 
5  bb   zz  up

我有一個規則，以除去行1和2和3中，作爲同時列2和欄3用於行1，2和3是相同的，矛盾的數據（up和down）在塔中發現4.

如何請問R鍵除去在列2和欄3但是訂約柱3具有相同名稱的那些行以產生一矩陣，如下所示：

row.no column2 column3 column4 
4  bb   yy  down 
5  bb   zz  up

來源

2011-04-17 Catherine

包plyr中的函數真的在這類問題上大放異彩。這是一個使用兩行代碼的解決方案。

設置數據（由@GavinSimpson友情提供）

dat <- structure(list(row.no = 1:5, column2 = structure(c(1L, 1L, 1L, 
1L, 1L), .Label = "bb", class = "factor"), column3 = structure(c(1L, 
1L, 1L, 2L, 3L), .Label = c("ee", "yy", "zz"), class = "factor"), 
    column4 = structure(c(2L, 1L, 2L, 1L, 2L), .Label = c("down", 
    "up"), class = "factor")), .Names = c("row.no", "column2", 
"column3", "column4"), class = "data.frame", row.names = c(NA, 
-5L))

裝入plyr包

library(plyr)

使用ddply分裂，分析，並結合DAT。以下代碼分析行將數據拆分爲（column2和column3）的獨特組合。然後我添加一個名爲unique的列，它計算每個set的column4的唯一值的數量。最後，用一個簡單的子集返回只有那些線，獨特的== 1和下降5列

df <- ddply(dat, .(column2, column3), transform, 
    row.no=row.no, unique=length(unique(column4))) 
df[df$unique==1, -5]

而且結果：

row.no column2 column3 column4 
4  4  bb  yy down 
5  5  bb  zz  up

來源

2011-04-17 08:30:27 Andrie

+1使用plyr – 2011-04-17 09:37:02

-1

您可以嘗試以下兩種方法之一。假設該表被稱爲'table1'。

方法1

repeated_rows = c(); 
for (i in 1:(nrow(table1)-1)){ 
    for (j in (i+1):nrow(table1)){ 
    if (sum((table1[i,2:3] == table1[j,2:3])) == 2){ 
     repeated_rows = c(repeated_rows, i, j) 
    } 
    } 
} 
repeated_rows = unique(repeated_rows) 
table1[-repeated_rows,]

方法2

duplicates = duplicated(table1[,2:3]) 
for (i in 1:length(duplicates)){ 
    if (duplicates[i] == TRUE){ 
    for (j in 1:nrow(table1)){ 
     if (sum(table1[i,2:3] == table1[j,2:3]) == 2){ 
     duplicates[j] = TRUE; 
     } 
    } 
    } 
} 
table1[!duplicates,]

來源

2011-04-17 05:43:20

這裏是一個潛在的，如果有點不雅，溶液

out <- with(dat, split(dat, interaction(column2, column3))) 
out <- lapply(out, function(x) if(NROW(x) > 1) {NULL} else {data.frame(x)}) 
out <- out[!sapply(out, is.null)] 
do.call(rbind, out)

其中給出：

> do.call(rbind, out) 
     row.no column2 column3 column4 
bb.yy  4  bb  yy down 
bb.zz  5  bb  zz  up

一些說明，一行行：

1行：數據分裂成一個列表，每個組成部分是與對應於通過的column2和column3獨特組合形成的組的行的數據幀。
第2行：迭代在從第1行的結果;如果數據幀中有多於1行，則返回NULL，如果不是則返回1行數據幀。
第3行：遍歷第2行的輸出;僅返回非空組件
4行：需要綁定，逐行，從第3行輸出，這是我們安排經由do.call()

這可以簡化爲兩行，結合線1 -3到單個行：

dat <- structure(list(row.no = 1:5, column2 = structure(c(1L, 1L, 1L, 
1L, 1L), .Label = "bb", class = "factor"), column3 = structure(c(1L, 
1L, 1L, 2L, 3L), .Label = c("ee", "yy", "zz"), class = "factor"), 
    column4 = structure(c(2L, 1L, 2L, 1L, 2L), .Label = c("down", 
    "up"), class = "factor")), .Names = c("row.no", "column2", 
"column3", "column4"), class = "data.frame", row.names = c(NA, 
-5L))

來源

2011-04-17 06:42:25

謝謝加文，當我輸入的第一線，我發現了以下錯誤消息：「sort.list（y）中的錯誤：'x'必須是'sort.list'的原子。你是否在列表中調用'sort'？你能介意教我如何解決這個問題嗎？ – Catherine 2011-04-17 07:27:31

@sally我在你顯示的數據片段中讀取 - 它在一個名爲'dat'的數據框中 - 創建'dat'的代碼現在包含在我的答案中。你沒有說你的數據是如何存儲的，所以我使用了邏輯數據結構（一個數據框）。 – 2011-04-17 07:39:41

+1對於使用base R – Andrie 2011-04-17 09:04:45

加文保持提高日：

out <- lapply(with(dat, split(dat, interaction(column2, column3))), 
       function(x) if(NROW(x) > 1) {NULL} else {data.frame(x)}) 
do.call(rbind, out[!sapply(out, is.null)])

上面全部是用做在答案的質量上吧。這是我的嘗試。

# This is one way of importing the data into R 
sally <- textConnection("row.no column2 column3 column4 
1  bb   ee  up 
2  bb   ee  down 
3  bb   ee  up 
4  bb   yy  down 
5  bb   zz  up") 
sally <- read.table(sally, header = TRUE) 

# Order the data frame to make rle work its magic 
sally <- sally[order(sally$column3, sally$column4), ] 

# Find which values are repeating 
sally.rle2 <- rle(as.character(sally$column2)) 
sally.rle3 <- rle(as.character(sally$column3)) 
sally.rle4 <- rle(as.character(sally$oclumn4)) 

sally.can.wait2 <- sally.rle2$values[which(sally.rle3$lengths != 1)] 
sally.can.wait3 <- sally.rle3$values[which(sally.rle3$lengths != 1)] 
sally.can.wait4 <- sally.rle4$values[which(sally.rle4$lengths != 1)] 

# Find which lines have values that are repeating 
dup <- c(which(sally$column2 == sally.can.wait2), 
     which(sally$column3 == sally.can.wait3), 
     which(sally$column4 == sally.can.wait4)) 
dup <- dup[duplicated(dup)] 

# Display the lines that have no repeating values 
sally[-dup, ]

來源

2011-04-17 07:40:28

+1使用'rle' – Andrie 2011-04-17 09:04:16

+1有趣的使用'rle（）'。你不能使用'lapply（）'來安排'rle（）'調用嗎？而且對於隨後的重複代碼呢？ – 2011-04-17 09:38:38

@Gavin，真的。每當您創建一些以類似方式完成的對象時，通常可以使用應用系列函數。 – 2011-04-17 10:03:07

R：刪除列基於兩個列的相似性檢查

回答

相關問題