重複ID檢查

我有數據與人名和他們的ID號列表。有些人被列出兩三次。每個人都有一個身份證號碼 - 如果他們被列入多次，只要他是同一個人，他們的身份證號碼將保持不變。像這樣：重複ID檢查

Name david david john john john john megan bill barbara chris chris 

ID  1  1 2 2 2 2 3 4 5 6 6

我需要確保這些ID號碼是正確的，並且不同的人沒有相同的ID號碼。爲此，我想創建一個新的變量來分配新的ID號碼，以便我可以將新的ID號碼與舊號碼進行比較。我想創建一個命令，說「如果他們的名字是相同的，使他們的ID號碼相同」。我該怎麼做？這有意義嗎？

來源

2017-08-10 Rachel

獨特的名稱，添加ID，然後把它合併 – Wen

我將無法使用唯一的（名稱），以原始數據集，因爲這樣的長度是不同的後合併？ – Rachel

您將可以合併。合併是基於公共值的查找功能。與Access或vlookup中的dlookup和Excel或Calc中的hlookup類似。 –

有很多方法可以做到這一點，其中一些是上面提出的。我通常使用dplyr版本來發現和刪除重複/不好的情況。根據您的目標，以下是各種輸出的示例。

library(dplyr) 

# example with one bad case 
dt = data.frame(Name = c("david","davud","John","John","megan"), 
       ID = c(1,1,2,3,3), stringsAsFactors = F) 


# spot names with more than 1 unique IDs 
dt %>% 
    group_by(Name) %>% 
    summarise(NumIDs = n_distinct(ID)) %>% 
    filter(NumIDs > 1) 

# # A tibble: 1 x 2 
# Name NumIDs 
# <chr> <int> 
# 1 John  2 


# spot names with more than 1 unique IDs and the actual IDs 
dt %>% 
    group_by(Name) %>% 
    mutate(NumIDs = n_distinct(ID)) %>% 
    filter(NumIDs > 1) %>% 
    ungroup() 

# # A tibble: 2 x 3 
# Name ID NumIDs 
# <chr> <dbl> <int> 
# 1 John  2  2 
# 2 John  3  2 


# spot names with more than 1 unique IDs and the actual IDs - alternative 
dt %>% 
    group_by(Name) %>% 
    mutate(NumIDs = n_distinct(ID)) %>% 
    filter(NumIDs > 1) %>% 
    group_by(Name, NumIDs) %>% 
    summarise(IDs = paste0(ID, collapse=",")) %>% 
    ungroup() 

# # A tibble: 1 x 3 
#  Name NumIDs IDs 
#  <chr> <int> <chr> 
# 1 John  2 2,3

來源

2017-08-11 12:08:39 AntoniosK

回答

相關問題