2015-07-09 63 views
3

@DavidArenburg關於條件鍵控連接的問題非常類似question,並且有一個額外的bugbear,我似乎無法理解。有條件的鍵入連接/更新_and_更新匹配的標誌列

基本上,除了條件連接,我想定義一個標誌,說明匹配過程的哪一步發生了匹配;我的問題是我只能得到標誌來定義的所有值,而不是匹配的值。

這就是我希望是一個最小的工作例如:

DT = data.table(
    name = c("Joe", "Joe", "Jim", "Carol", "Joe", 
      "Carol", "Ann", "Ann", "Beth", "Joe", "Joe"), 
    surname = c("Smith", "Smith", "Jones", 
       "Clymer", "Smith", "Klein", "Cotter", 
       "Cotter", "Brown", "Smith", "Smith"), 
    maiden_name = c("", "", "", "", "", "Clymer", 
        "", "", "", "", ""), 
    id = c(1, 1:3, rep(NA, 7)), 
    year = rep(1:4, c(4, 3, 2, 2)), 
    flag1 = NA, flag2 = NA, key = "year" 
) 

DT 
#  name surname maiden_name id year flag1 flag2 
# 1: Joe Smith    1 1 FALSE FALSE 
# 2: Joe Smith    1 1 FALSE FALSE 
# 3: Jim Jones    2 1 FALSE FALSE 
# 4: Carol Clymer    3 1 FALSE FALSE 
# 5: Joe Smith    NA 2 FALSE FALSE 
# 6: Carol Klein  Clymer NA 2 FALSE FALSE 
# 7: Ann Cotter    NA 2 FALSE FALSE 
# 8: Ann Cotter    NA 3 FALSE FALSE 
# 9: Beth Brown    NA 3 FALSE FALSE 
# 10: Joe Smith    NA 4 FALSE FALSE 
# 11: Joe Smith    NA 4 FALSE FALSE 

我的做法是,每年可爲,先嚐試和匹配上從上年姓/名;如果失敗,則嘗試匹配名/姓。我想定義flag1表示完全匹配,並且flag2表示婚姻。

for (yr in 2:4) { 

    #which ids have we hit so far? 
    existing_ids = DT[.(yr), unique(id)] 

    #find people in prior years appearing to 
    # correspond to those people 
    unmatched = 
    DT[.(1:(yr - 1))][!id %in% existing_ids, .SD[.N], by = id] 
    setkey(unmatched, name, surname) 

    #merge a la Arun, define flag1 
    setkey(DT, name, surname) 
    DT[year == yr, c("id", "flag1") := unmatched[.SD, .(id, TRUE)]] 
    setkey(DT, year) 

    #repeat, this time keying on name/maiden_name 
    existing_ids = DT[.(yr), unique(id)] 
    unmatched = 
    DT[.(1:(yr - 1))][!id %in% existing_ids, .SD[.N],by=id] 
    setkey(unmatched, name, surname) 

    #now define flag2 = TRUE 
    setkey(DT, name, maiden_name) 
    DT[year==yr & is.na(id), c("id", "flag2") := unmatched[.SD, .(id, TRUE)]] 
    setkey(DT, year) 

    #this is messy, but I'm trying to increment id 
    # for "new" individuals 
    setkey(DT, name, surname, maiden_name) 
    DT[year == yr & is.na(id), 
    id := unique(
     DT[year == yr & is.na(id)], 
     by = c("name", "surname", "maiden_name") 
    )[ , count := .I][.SD, count] + DT[ , max(id, na.rm = TRUE)] 
    ] 

    #re-sort by year at the end  
    setkey(DT, year)  
} 

我希望通過在j參數的TRUE價值,同時我定義id,只有匹配name S(例如,喬在第一步)將有自己的flag更新TRUE,但這ISN 「T的情況下 - 他們都更新:

DT[] 
#  name surname maiden_name id year flag1 flag2 
# 1: Carol Clymer    3 1 FALSE FALSE 
# 2: Jim Jones    2 1 FALSE FALSE 
# 3: Joe Smith    1 1 FALSE FALSE 
# 4: Joe Smith    1 1 FALSE FALSE 
# 5: Ann Cotter    4 2 TRUE TRUE 
# 6: Carol Klein  Clymer 3 2 TRUE TRUE 
# 7: Joe Smith    1 2 TRUE FALSE 
# 8: Ann Cotter    4 3 TRUE FALSE 
# 9: Beth Brown    5 3 TRUE TRUE 
# 10: Joe Smith    1 4 TRUE FALSE 
# 11: Joe Smith    1 4 TRUE FALSE 

有什麼辦法僅更新匹配的行」 flag值?理想的輸出如下:

DT[] 
#  name surname maiden_name id year flag1 flag2 
# 1: Carol Clymer    3 1 FALSE FALSE 
# 2: Jim Jones    2 1 FALSE FALSE 
# 3: Joe Smith    1 1 FALSE FALSE 
# 4: Joe Smith    1 1 FALSE FALSE 
# 5: Ann Cotter    4 2 FALSE FALSE 
# 6: Carol Klein  Clymer 3 2 FALSE TRUE 
# 7: Joe Smith    1 2 TRUE FALSE 
# 8: Ann Cotter    4 3 TRUE FALSE 
# 9: Beth Brown    5 3 FALSE FALSE 
# 10: Joe Smith    1 4 TRUE FALSE 
# 11: Joe Smith    1 4 TRUE FALSE 

回答

0

的關鍵(沒有雙關語意)我認爲這是要認識到,合併返航NA爲錯過的ID,所以我應該在每一個步驟,例如添加flagunmatched,在步驟1:

unmatched <- dt[.(1:(yr - 1L)) 
       ][!id %in% existing_ids, 
        .SD[.N], by = id][ , flag1 := TRUE] 
dt[year == yr, c("id", "flag1") := 
    unmatched[.SD, .(id, flag1), on = "name,surname"]] 

最終,這產生:

> dt[ ] 
    name surname maiden_name id year flag1 flag2 
1: Carol Clymer    3 1 FALSE FALSE 
2: Jim Jones    2 1 FALSE FALSE 
3: Joe Smith    1 1 FALSE FALSE 
4: Joe Smith    1 1 FALSE FALSE 
5: Ann Cotter    4 2 NA NA 
6: Carol Klein  Clymer 3 2 NA TRUE 
7: Joe Smith    1 2 TRUE FALSE 
8: Ann Cotter    4 3 TRUE FALSE 
9: Beth Brown    5 3 NA NA 
10: Joe Smith    1 4 TRUE FALSE 
11: Joe Smith    1 4 TRUE FALSE 

一個問題其餘是一些標誌,應該是F已復位到NA;會很高興能夠設置nomatch=F,但我並不太擔心這種副作用 - 對我來說關鍵在於知道每個標記的時間是T

3

我認爲這裏的標誌是混亂的;更簡單地識別id來源:

dt[,c("flag1","flag2"):=NULL] 

# create name -> id table 
namemap <- unique(dt[,.(maiden_name,id,year),keyby=.(name,surname)],by=NULL) 

# tag original ids 
namemap[!is.na(id),src:="original"] 

# carried over from earlier years 
namemap[, has_oid := any(!is.na(id)), by=key(namemap)] 
namemap[(has_oid),`:=`(
    id = id[!is.na(id)], 
    src = ifelse(is.na(id), "history", src) 
),by=.(name,surname)] 

# carry over for surname changes on marriage 
namemap[maiden_name!="",`:=`(
    id = namemap[.BY]$id, 
    src = "maiden" 
),by=.(name,maiden_name)] 

# create new ids where none exists 
namemap[is.na(id),`:=`(
    id = .GRP+max(dt$id,na.rm=TRUE), 
    src = "new" 
),by=.(name,surname)] 

# copy back to the original table 
setkey(dt,name,surname,year) 
setkey(namemap,name,surname,year) 
dt[namemap,`:=`(
    id = i.id, 
    src = src 
)] 

這給

 name surname maiden_name id year  src 
1: Ann Cotter    4 2  new 
2: Ann Cotter    4 3  new 
3: Beth Brown    5 3  new 
4: Carol Clymer    3 1 original 
5: Carol Klein  Clymer 3 2 maiden 
6: Jim Jones    2 1 original 
7: Joe Smith    1 1 original 
8: Joe Smith    1 1 original 
9: Joe Smith    1 2 history 
10: Joe Smith    1 4 history 
11: Joe Smith    1 4 history 

數據的原始順序丟失,但很容易恢復,如果你想要它。

+0

所以基本上,我們把我正在做的合併的結果合併到原始表中? – MichaelChirico

+0

@MichaelChirico我已經更新了我的答案。這可能是我會做的。我想,沒有必要提及幾年。 – Frank

+0

恐怕我犯了過於簡單化了我的工作示例orz工作的東西更準確的東西我現在要去什麼 – MichaelChirico