2016-08-22 129 views
3

所以,我有兩個數據集表示舊的和當前的地址。R - 合併並更新主數據集

> main 
idspace id x y move 
    198 1238 33 4 stay 
    641 1236 36 12 move 
    1515 1237 30 28 move 

> move 
idspace id x y move 
     4 1236 4 1 move 

我需要的是合併與舊(main)新數據(move)和更新main一次合併。

我想知道是否可以在一個操作?

更新基於id,這是個人標識符。

idspace,x,y是位置ID。

所以,我需要輸出爲

> main 
    idspace id x y move 
     198 1238 33 4 stay 
     4 1236 4 1 move # this one is updated 
     1515 1237 30 28 move 

我不知道我怎麼能做到這一點。

喜歡的東西

merge(main, move, by = c('id'), all = T, suffixes = c('old', 'new')) 

然而,這是錯誤的,因爲我需要手工做那麼多操作。

任何解決方案?

數據

> dput(main) 
structure(list(idspace = structure(c(2L, 3L, 1L), .Label = c("1515", 
"198", "641"), class = "factor"), id = structure(c(3L, 1L, 2L 
), .Label = c("1236", "1237", "1238"), class = "factor"), x = structure(c(2L, 
3L, 1L), .Label = c("30", "33", "36"), class = "factor"), y = structure(c(3L, 
1L, 2L), .Label = c("12", "28", "4"), class = "factor"), move =  structure(c(2L, 
1L, 1L), .Label = c("move", "stay"), class = "factor")), .Names = c("idspace", 
"id", "x", "y", "move"), row.names = c(NA, -3L), class = "data.frame") 

> dput(move) 
structure(list(idspace = structure(1L, .Label = "4", class = "factor"), 
id = structure(1L, .Label = "1236", class = "factor"), x = structure(1L, .Label = "4", class = "factor"), 
    y = structure(1L, .Label = "1", class = "factor"), move = structure(1L, .Label = "move", class = "factor")), .Names = c("idspace", 
"id", "x", "y", "move"), row.names = c(NA, -1L), class = "data.frame")` 
+1

我認爲這是一個dup爲'tmp < - rbind(move,main); tmp [!duplicate(tmp $ id)],'邏輯工作得很好,假設這裏沒有其他要求。 – thelatemail

+0

@thelatemail我正在考慮使用'sqldf',但我不知道這個API足夠好回答。 –

+1

@TimBiegeleisen - 也許'sqldf(」 選擇COALESCE(b.idspace,a.idspace)作爲idspace, COALESCE(b.id,a.id)作爲ID, COALESCE(BX,AX)爲x, COALESCE (by,ay)as y, coalesce(b.move,a.move)as move from main a left join move b on a.id = b.id 「)' - 醜但它確實有效。 – thelatemail

回答

9

使用加盟+更新的data.table特點:

require(data.table) # v1.9.6+ 
setDT(main) # convert data.frames to data.tables by reference 
setDT(move) 

main[move, on=c("id", "move"), # extract the row number in 'main' where 'move' matches 
     c("idspace", "x", "y") := .(i.idspace, i.x, i.y)] # update cols of 'main' with 
                 # values from 'i' = 'move' for 
                 # those matching rows 


main 
# idspace id x y move 
# 1:  198 1238 33 4 stay 
# 2:  4 1236 4 1 move 
# 3: 1515 1237 30 28 move 

這將更新就地main

+1

這太棒了!每個機會有任何'dplyr'例程? – giacomo

+2

詢問dplyr解決方案的主要data.table開發人員...嗯... – nrussell

+0

好的確定對不起;) - 仍然很棒的解決方案! – giacomo

1

這裏有一個dplyr解決方案:

# If you want both old and new 
dplyr::full_join(main, move) 

# If you want both old and new with a suffix column 
main$suffix <- "old" 
move$suffix <- "new" 
dplyr::full_join(main, move) 

# If you want new only 
new  <- dplyr::left_join(main,move,by="id") # could also use %>% 
main[!is.na(new$move.y),1] <- new[!is.na(new$move.y),6] 
main[!is.na(new$move.y),3:4] <- new[!is.na(new$move.y),7:8] 
1

我想我發現了一個很簡單的方法來解決這個問題,

main = as.matrix(main) 
move = as.matrix(move) 

main[main[,'id'] %in% move[,'id'], ] <- move 

它匹配id,保持id有序,只改變匹配rows 。它似乎對整個數據集起作用。

+0

請注意,在這種情況下無法知道哪個'main $ id'與哪個'move $ id'匹配。你假設這些匹配將與'move'中的行相同。 – Arun

+0

@你是完全正確的。但是,它似乎工作。我也嘗試了'main [,'id']%in move [,'id'],c('idspace','x','y','move')] < - move [which(move [,'id']%in%main [,'id']),c('idspace','x','y','move')]'也可以更新。在後一種情況下,這個「id」是匹配的。再次感謝您的耐心和關注! – giacomo

+1

'%in%'返回一個邏輯向量。它始終保留子集上輸入數據的順序。嘗試一個更復雜的例子。例如,如果'main $ id'的第1和第3項與'move $ id'的第3和第1項相匹配,則將'move'的第1和第3行分配給'main'的第1和第3行。那是錯誤的。 – Arun