我一直試圖做一個乏味的合併(在非常大的數據),在一個完全匹配和一個部分。我已經嘗試了幾種方法(使用pmatch,str_detect,grep和sapply),並得到了一些接近的結果,但試圖找到一個優雅的解決方案。任何幫助見解將不勝感激。合併一個完全匹配和一個部分URL匹配的兩個數據幀
另一個長潰敗,我發現是做(seesionId)共同領域的定期合併,然後寫一個for循環象下面這樣:
for(i in 1:nrow(my.test.daa)){
my.test.daa$Part_match [i] = pmatch(my.test.daa$Link_URL[i], my.test.daa$Referer[i])
...get index i to also get the other columns from dataset frame
}
新的數據 - 與重複
pattern <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1",
"5b8cc8794a02ba868db21faef2",
"5b8cc8794a02ba868db21faef3",
"5b8cc8794a02ba868db21faef4",
"5b8cc8794a02ba868db21faef5",
"5b8cc8794a02ba868db21faef1")),
URL = I(c("somewebsite.com/abc/detail/110302288511/",
"somewebsite.com/abc/detail/110302288512/",
"somewebsite.com/abc/detail/110302288513/",
"somewebsite.com/abc/detail/110302288514/",
"somewebsite.com/abc/detail/110302288511/",
"somewebsite.com/abc/detail/110302288512/"
)))
dataset <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1",
"5b8cc8794a02ba868db21faef3",
"5b8cc8794a02ba868db21faef5",
"5b8cc8794a02ba868db21faef7",
"5b8cc8794a02ba868db21faef1"
)),
Referer = I(c("somewebsite.com/abc/detail/110302288511/110302288512/",
"somewebsite.com/abc/detail/110302288513/1103022815/",
"somewebsite.com/abc/detail/110302288513/11030228/",
"somewebsite.com/abc/detail/110302288465464/",
"somewebsite.com/abc/detail/110302288512/46545465/"
)))
OLD - 以下是data.frams的示例代碼:
pattern <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1",
"5b8cc8794a02ba868db21faef2",
"5b8cc8794a02ba868db21faef3",
"5b8cc8794a02ba868db21faef4",
"5b8cc8794a02ba868db21faef5",
"5b8cc8794a02ba868db21faef6")),
URL = I(c("somewebsite.com/abc/detail/110302288511/",
"somewebsite.com/abc/detail/110302288512/",
"somewebsite.com/abc/detail/110302288513/",
"somewebsite.com/abc/detail/110302288514/",
"somewebsite.com/abc/detail/110302288511/",
"somewebsite.com/abc/detail/110302288512/"
)))
dataset <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1",
"5b8cc8794a02ba868db21faef3",
"5b8cc8794a02ba868db21faef5",
"5b8cc8794a02ba868db21faef7",
"5b8cc8794a02ba868db21faef2"
)),
Referer = I(c("somewebsite.com/abc/detail/110302288511/110302288512/",
"somewebsite.com/abc/detail/110302288513/1103022815/",
"somewebsite.com/abc/detail/110302288513/11030228/",
"somewebsite.com/abc/detail/110302288465464/",
"somewebsite.com/abc/detail/1103022846546/"
)))
新的輸出 - 包含重複
SessionId URL Referer
5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288511/ somewebsite.com/abc/detail/110302288511/110302288512/
5b8cc8794a02ba868db21faef3 somewebsite.com/abc/detail/110302288513/ somewebsite.com/abc/detail/110302288513/1103022815/
5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288512/ somewebsite.com/abc/detail/110302288512/46545465/
所以OLD輸出需要看起來像這樣:
SessionId URL Referer
5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288511/ somewebsite.com/abc/detail/110302288511/110302288512/
5b8cc8794a02ba868db21faef3 somewebsite.com/abc/detail/110302288513/ somewebsite.com/abc/detail/110302288513/1103022815/
你應該在'grep'中用''''''Referer'''替換'dd $ Referer',這就是爲什麼你得到不同的結果;不幸的是,我不明白這是如何改善OP - 你只是用一個應用循環替換for循環 – eddi
@eddi我完全修改現在可能會回答。 – agstudy
我很想看到這個vs Rcpp的基準(我現在不能這樣做) – eddi