2017-03-03 88 views
-2

我很好奇如何找到重複集的行索引。這可能有效嗎?R返回行重複集的索引

的什麼,我找

df <- data.frame(state =c("MA", "MA", "MA", "NY", "CA", "CA", "CA"), 
       city = c("Boston", "Boston", "Lawrence", "New York", "San Francisco", "San Francisco", "Boston")) 
duplicate_sets(df, N=2) 
# Should return something like "found duplicates in rows (1, 2), (5, 6)" 
+1

向量沒有行的概念。你能澄清你的問題,並給出一個可重複的例子嗎? –

+0

你想找出兩個列或只有一個重複? –

+0

如果分心,則更新爲數據框。兩列 - 「重複(df)」只會返回分解成重複集合。 –

回答

2

鋒線實例:找到副本可能是昂貴的。這是方法base::duplicated.data.frame用於啓動該過程,將data.frame轉換爲每行的character矢量,然後以這種方式查找重複項。不幸的是,duplicated只返回行(不包括第一個實例)的第二個實例(並超出),因此它不符合您的需要。我沒有.Internal(duplicated(...))代碼方便,所以這是一個接近的解決方案。

使用table

df <- data.frame(state =c("MA", "MA", "MA", "NY", "CA", "CA", "CA"), 
       city = c("Boston", "Boston", "Lawrence", "New York", "San Francisco", "San Francisco", "Boston")) 

duplicate_sets <- function(df) { 
    # assuming a data.frame 
    xvec <- do.call("paste", c(df, sep = "\r")) 
    matches <- Filter(c, table(xvec) > 1) 
    lapply(names(matches), function(x) which(xvec == x)) 
} 

duplicate_sets(df) 
# [[1]] 
# [1] 5 6 
# [[2]] 
# [1] 1 2 

它並不保證進行排序,但應該是足夠的瑣碎讓你對自己的展開(如果它甚至是重要的)。

+0

謝謝,我更新了,以避免行定義混淆。 –

0

這有點冒失,根據你有多少列創建鍵可能是低效的,但這是我喜歡用於這種分析類型的策略類型。

我覺得它能夠很好地瞭解每一步發生的事情,並且可以輕鬆地爲其他目的重構。如果你想這樣做,這也很容易包裝成一個功能。

library(dplyr) 
library(data.table) 

df <- data.frame(state =c("MA", "MA", "MA", "NY", "CA", "CA", "CA"), 
       city = c("Boston", "Boston", "Lawrence", "New York", "San Francisco", "San Francisco", "Boston")) 



# create a key vector - potentially inefficient depending on your number of columns 
df_keys <- sapply(data.frame(t(df), stringsAsFactors = F), paste0, collapse='|') 
df$df_keys <- df_keys 


# capture original order for use later on 
df$original_order <- 1:nrow(df) 


# find duplicate keys and create ids for each instance 
df_key_dupes <- df_keys[duplicated(df_keys)] 
df_key_dupes_id <- 1:length(df_key_dupes) 


df_dupe <- data.frame(df_keys = df_key_dupes, df_key_dupes_id, stringsAsFactors = F) 


# I use data tables for efficient merges, then back to df 
setDT(df); setDT(df_dupe) 
df <- merge(x=df, y=df_dupe, by='df_keys', all=T, sort=F) 
setDF(df) 

# remove NAs which indicate they aren't a dupe 
df2 <- df[!is.na(df$df_key_dupes_id),] 


# group by the dupe id and paste collapse the original_order field with a comma and space 
df2 <- group_by(df2, df_key_dupes_id) %>% 
    summarise(dupe_set=paste0(original_order, collapse=', ')) 


# print out according to request (sorry, I'm a painfully literal human) 
cat("duplicates in rows: ", paste0('(', df2$dupe_set, ') '))