就可以得到通過merge
對應的在同一直線上SIP和DIP記錄:
df <- data.frame(
"UID" = c(720107626538, 720108826800),
"SIP" = c(1207697420, 3232248333),
"DIP" = c(3232248333, 1207697420),
"PROTOCOL" = c(17, 17),
"SPORT" = c(53, 47904),
"DPORT" = c(7722, 53),
stringsAsFactors = FALSE)
df_merged <- merge(
df[,setdiff(colnames(df), "DIP")],
df[,setdiff(colnames(df), "SIP")],
by.x = "SIP",
by.y = "DIP",
all = FALSE,
suffixes = c("_SIP", "_DIP"))
之後,就可以使用UID字段刪除重複:
for(i in 2:nrow(df_merged)) {
ind <- df_merged$UID_DIP
ind[i] <- df_merged$UID_SIP[i]
df_merged <- df_merged[!duplicated(ind),]
}
df_merged
df_merged
SIP UID_SIP PROTOCOL_SIP SPORT_SIP DPORT_SIP UID_DIP PROTOCOL_DIP SPORT_DIP DPORT_DIP
1 1207697420 720107626538 17 53 7722 720108826800 17 47904 53
因爲去重複依賴於一個循環,如果數據集很大,整個事情可能會非常耗時。
我該如何擺脫重複的行? –
一個數據集中的DIP將與第二個數據集中的SIP相匹配,但僅限於下一個匹配,按UID排序。 –
什麼定義了重複*完全*?如果其他變量相同,只是'SIP'和'DIP'的順序不同? –