@DavidArenburg關於條件鍵控連接的問題非常類似question,並且有一個額外的bugbear,我似乎無法理解。有條件的鍵入連接/更新_and_更新匹配的標誌列
基本上,除了條件連接,我想定義一個標誌,說明匹配過程的哪一步發生了匹配;我的問題是我只能得到標誌來定義的所有值,而不是匹配的值。
這就是我希望是一個最小的工作例如:
DT = data.table(
name = c("Joe", "Joe", "Jim", "Carol", "Joe",
"Carol", "Ann", "Ann", "Beth", "Joe", "Joe"),
surname = c("Smith", "Smith", "Jones",
"Clymer", "Smith", "Klein", "Cotter",
"Cotter", "Brown", "Smith", "Smith"),
maiden_name = c("", "", "", "", "", "Clymer",
"", "", "", "", ""),
id = c(1, 1:3, rep(NA, 7)),
year = rep(1:4, c(4, 3, 2, 2)),
flag1 = NA, flag2 = NA, key = "year"
)
DT
# name surname maiden_name id year flag1 flag2
# 1: Joe Smith 1 1 FALSE FALSE
# 2: Joe Smith 1 1 FALSE FALSE
# 3: Jim Jones 2 1 FALSE FALSE
# 4: Carol Clymer 3 1 FALSE FALSE
# 5: Joe Smith NA 2 FALSE FALSE
# 6: Carol Klein Clymer NA 2 FALSE FALSE
# 7: Ann Cotter NA 2 FALSE FALSE
# 8: Ann Cotter NA 3 FALSE FALSE
# 9: Beth Brown NA 3 FALSE FALSE
# 10: Joe Smith NA 4 FALSE FALSE
# 11: Joe Smith NA 4 FALSE FALSE
我的做法是,每年可爲,先嚐試和匹配上從上年姓/名;如果失敗,則嘗試匹配名/姓。我想定義flag1
表示完全匹配,並且flag2
表示婚姻。
for (yr in 2:4) {
#which ids have we hit so far?
existing_ids = DT[.(yr), unique(id)]
#find people in prior years appearing to
# correspond to those people
unmatched =
DT[.(1:(yr - 1))][!id %in% existing_ids, .SD[.N], by = id]
setkey(unmatched, name, surname)
#merge a la Arun, define flag1
setkey(DT, name, surname)
DT[year == yr, c("id", "flag1") := unmatched[.SD, .(id, TRUE)]]
setkey(DT, year)
#repeat, this time keying on name/maiden_name
existing_ids = DT[.(yr), unique(id)]
unmatched =
DT[.(1:(yr - 1))][!id %in% existing_ids, .SD[.N],by=id]
setkey(unmatched, name, surname)
#now define flag2 = TRUE
setkey(DT, name, maiden_name)
DT[year==yr & is.na(id), c("id", "flag2") := unmatched[.SD, .(id, TRUE)]]
setkey(DT, year)
#this is messy, but I'm trying to increment id
# for "new" individuals
setkey(DT, name, surname, maiden_name)
DT[year == yr & is.na(id),
id := unique(
DT[year == yr & is.na(id)],
by = c("name", "surname", "maiden_name")
)[ , count := .I][.SD, count] + DT[ , max(id, na.rm = TRUE)]
]
#re-sort by year at the end
setkey(DT, year)
}
我希望通過在j
參數的TRUE
價值,同時我定義id
,只有匹配name
S(例如,喬在第一步)將有自己的flag
更新TRUE
,但這ISN 「T的情況下 - 他們都更新:
DT[]
# name surname maiden_name id year flag1 flag2
# 1: Carol Clymer 3 1 FALSE FALSE
# 2: Jim Jones 2 1 FALSE FALSE
# 3: Joe Smith 1 1 FALSE FALSE
# 4: Joe Smith 1 1 FALSE FALSE
# 5: Ann Cotter 4 2 TRUE TRUE
# 6: Carol Klein Clymer 3 2 TRUE TRUE
# 7: Joe Smith 1 2 TRUE FALSE
# 8: Ann Cotter 4 3 TRUE FALSE
# 9: Beth Brown 5 3 TRUE TRUE
# 10: Joe Smith 1 4 TRUE FALSE
# 11: Joe Smith 1 4 TRUE FALSE
有什麼辦法僅更新匹配的行」 flag
值?理想的輸出如下:
DT[]
# name surname maiden_name id year flag1 flag2
# 1: Carol Clymer 3 1 FALSE FALSE
# 2: Jim Jones 2 1 FALSE FALSE
# 3: Joe Smith 1 1 FALSE FALSE
# 4: Joe Smith 1 1 FALSE FALSE
# 5: Ann Cotter 4 2 FALSE FALSE
# 6: Carol Klein Clymer 3 2 FALSE TRUE
# 7: Joe Smith 1 2 TRUE FALSE
# 8: Ann Cotter 4 3 TRUE FALSE
# 9: Beth Brown 5 3 FALSE FALSE
# 10: Joe Smith 1 4 TRUE FALSE
# 11: Joe Smith 1 4 TRUE FALSE
所以基本上,我們把我正在做的合併的結果合併到原始表中? – MichaelChirico
@MichaelChirico我已經更新了我的答案。這可能是我會做的。我想,沒有必要提及幾年。 – Frank
恐怕我犯了過於簡單化了我的工作示例orz工作的東西更準確的東西我現在要去什麼 – MichaelChirico