這是一個在添加字段(「匹配」)上使用agrep
的示例,它是要用於識別重複項的所選字段的拼接(根據需要添加其他字段)。在這個例子中,列表索引對應於data.frame的行。
# make a mock data.frame
df <- read.csv(textConnection("
id,FNAME,LNAME
1,Aaron,Golding
2,Aroon,Golding
3,Aaron,Golding
4,John,Bold
5,Markus,M.
6,John,Bald
"))
# string together the fields that might be matching and add to data.frame
df$match <- paste0(trimws(as.character(df$FNAME)),
trimws(as.character(df$LNAME)))
# make an empty list to fill in
possibleDups <- list()
# loop through each row and find matching strings
for(i in 1:nrow(df)){
dups <- agrep(df$match[i], df$match)
if(length(dups) != 1){possibleDups[[i]] <- dups[dups != i]} else {
possibleDups[[i]] <- NA
}
}
# proof - print the list of possible duplicates
print(possibleDups)
> [[1]]
> [1] 2 3
> [[2]]
> [1] 1 3
> [[3]]
> [1] 1 2
> [[4]]
> [1] 6
> [[5]]
> [1] NA
> [[6]]
> [1] 4
如果你只是想重複的字符串列表,你可以使用這個循環,而不是前一個和刪除創建一個空錶行。
for(i in 1:nrow(df)){
dups = agrep(df$match[i], df$match)
if(length(dups) != 1){df$possibleDups[i] <- paste(dups[dups != i],
collapse = ',')} else {
df$possibleDups[i] <- NA
}
}
print(df)
> id FNAME LNAME match possibleDups
> 1 1 Aaron Golding AaronGolding 2,3
> 2 2 Aaron Golding AaronGolding 1,3
> 3 3 Aaron Golding AaronGolding 1,2
> 4 4 John Bold JohnBold 6
> 5 5 Markus M. MarkusM. <NA>
> 6 6 John Bald JohnBald 4
我認爲OP希望向量作爲data.frame的元素,而不是逗號分隔的字符串,所以你可以加上'不公開(possibleDups,遞歸= FALSE)'作爲一個新的關口(未經測試) –
還未經測試,可能爲了避免循環:'df $ possible_duplicates < - Map(setdiff,lapply(df $ match,agrep,df $ match),1:nrow(df))' –
@Moody_Mudskipper是'Map' from the咕庫? – jdbcode