查找具有接近重複值的行的索引

我遇到了在數據集中找到重複行附近的問題。對於我的數據，我必須添加「POSSIBLE_DUPLICATES」列，它應該包含可能的重複索引。數據不僅包含字段FNAME和LNAME，還包含其他一些信息，也可用於查找重複信息。查找具有接近重複值的行的索引

| id | FNAME | LNAME | POSSIBLE_DUPLICATES | 
|----|--------|---------|---------------------| 
| 1 | Aaron | Golding | 2,3     | 
| 2 | Aroon | Golding | 1,3     | 
| 3 | Aaron | Golding | 2,1     | 
| 4 | John | Bold | 6     | 
| 5 | Markus | M.  |      | 
| 6 | John | Bald | 4     |

我試圖找到AGREP indicies（）函數，但我不太懂，我怎麼能調用它的多個列，以及如何Concat的所有行indicies。任何幫助將不勝感激。

來源

2017-08-02 Евгений М

這是一個在添加字段（「匹配」）上使用agrep的示例，它是要用於識別重複項的所選字段的拼接（根據需要添加其他字段）。在這個例子中，列表索引對應於data.frame的行。

# make a mock data.frame 
df <- read.csv(textConnection(" 
id,FNAME,LNAME 
1,Aaron,Golding 
2,Aroon,Golding 
3,Aaron,Golding 
4,John,Bold 
5,Markus,M. 
6,John,Bald 
")) 

# string together the fields that might be matching and add to data.frame 
df$match <- paste0(trimws(as.character(df$FNAME)), 
    trimws(as.character(df$LNAME))) 

# make an empty list to fill in 
possibleDups <- list() 

# loop through each row and find matching strings 
for(i in 1:nrow(df)){ 
    dups <- agrep(df$match[i], df$match) 
    if(length(dups) != 1){possibleDups[[i]] <- dups[dups != i]} else { 
    possibleDups[[i]] <- NA 
    } 
} 

# proof - print the list of possible duplicates 
print(possibleDups) 

> [[1]] 
> [1] 2 3 

> [[2]] 
> [1] 1 3 

> [[3]] 
> [1] 1 2 

> [[4]] 
> [1] 6 

> [[5]] 
> [1] NA 

> [[6]] 
> [1] 4

如果你只是想重複的字符串列表，你可以使用這個循環，而不是前一個和刪除創建一個空錶行。

for(i in 1:nrow(df)){ 
    dups = agrep(df$match[i], df$match) 
    if(length(dups) != 1){df$possibleDups[i] <- paste(dups[dups != i], 
    collapse = ',')} else { 
    df$possibleDups[i] <- NA 
    } 
} 

print(df) 

> id FNAME LNAME  match possibleDups 
> 1 1 Aaron Golding AaronGolding   2,3 
> 2 2 Aaron Golding AaronGolding   1,3 
> 3 3 Aaron Golding AaronGolding   1,2 
> 4 4 John Bold  JohnBold   6 
> 5 5 Markus  M.  MarkusM.   <NA> 
> 6 6 John Bald  JohnBald   4

來源

2017-08-02 23:32:36 jdbcode

我認爲OP希望向量作爲data.frame的元素，而不是逗號分隔的字符串，所以你可以加上'不公開（possibleDups，遞歸= FALSE）'作爲一個新的關口（未經測試） –

還未經測試，可能爲了避免循環：'df $ possible_duplicates < - Map（setdiff，lapply（df $ match，agrep，df $ match），1：nrow（df））' –

@Moody_Mudskipper是'Map' from the咕庫？ – jdbcode

查找具有接近重複值的行的索引

回答

相關問題