2013-08-01 71 views
2

我一直試圖做一個乏味的合併(在非常大的數據),在一個完全匹配和一個部分。我已經嘗試了幾種方法(使用pmatch,str_detect,grep和sapply),並得到了一些接近的結果,但試圖找到一個優雅的解決方案。任何幫助見解將不勝感激。合併一個完全匹配和一個部分URL匹配的兩個數據幀

另一個長潰敗,我發現是做(seesionId)共同領域的定期合併,然後寫一個for循環象下面這樣:

for(i in 1:nrow(my.test.daa)){ 
my.test.daa$Part_match [i] = pmatch(my.test.daa$Link_URL[i], my.test.daa$Referer[i]) 
...get index i to also get the other columns from dataset frame 
} 

新的數據 - 與重複

pattern <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1", 
            "5b8cc8794a02ba868db21faef2", 
            "5b8cc8794a02ba868db21faef3", 
            "5b8cc8794a02ba868db21faef4", 
            "5b8cc8794a02ba868db21faef5", 
            "5b8cc8794a02ba868db21faef1")), 
        URL = I(c("somewebsite.com/abc/detail/110302288511/", 
          "somewebsite.com/abc/detail/110302288512/", 
          "somewebsite.com/abc/detail/110302288513/", 
          "somewebsite.com/abc/detail/110302288514/", 
          "somewebsite.com/abc/detail/110302288511/", 
          "somewebsite.com/abc/detail/110302288512/" 
       ))) 


dataset <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1", 
            "5b8cc8794a02ba868db21faef3", 
            "5b8cc8794a02ba868db21faef5", 
            "5b8cc8794a02ba868db21faef7", 
            "5b8cc8794a02ba868db21faef1" 
        )), 
        Referer = I(c("somewebsite.com/abc/detail/110302288511/110302288512/", 
           "somewebsite.com/abc/detail/110302288513/1103022815/", 
           "somewebsite.com/abc/detail/110302288513/11030228/", 
           "somewebsite.com/abc/detail/110302288465464/", 
           "somewebsite.com/abc/detail/110302288512/46545465/" 
       ))) 

OLD - 以下是data.frams的示例代碼:

pattern <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1", 
            "5b8cc8794a02ba868db21faef2", 
            "5b8cc8794a02ba868db21faef3", 
            "5b8cc8794a02ba868db21faef4", 
            "5b8cc8794a02ba868db21faef5", 
            "5b8cc8794a02ba868db21faef6")), 
        URL = I(c("somewebsite.com/abc/detail/110302288511/", 
          "somewebsite.com/abc/detail/110302288512/", 
          "somewebsite.com/abc/detail/110302288513/", 
          "somewebsite.com/abc/detail/110302288514/", 
          "somewebsite.com/abc/detail/110302288511/", 
          "somewebsite.com/abc/detail/110302288512/" 
       ))) 


dataset <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1", 
           "5b8cc8794a02ba868db21faef3", 
           "5b8cc8794a02ba868db21faef5", 
           "5b8cc8794a02ba868db21faef7", 
           "5b8cc8794a02ba868db21faef2" 
          )), 
       Referer = I(c("somewebsite.com/abc/detail/110302288511/110302288512/", 
          "somewebsite.com/abc/detail/110302288513/1103022815/", 
          "somewebsite.com/abc/detail/110302288513/11030228/", 
          "somewebsite.com/abc/detail/110302288465464/", 
          "somewebsite.com/abc/detail/1103022846546/" 
       ))) 

新的輸出 - 包含重複

SessionId       URL          Referer 
5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288511/ somewebsite.com/abc/detail/110302288511/110302288512/ 
5b8cc8794a02ba868db21faef3 somewebsite.com/abc/detail/110302288513/ somewebsite.com/abc/detail/110302288513/1103022815/ 
5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288512/ somewebsite.com/abc/detail/110302288512/46545465/ 

所以OLD輸出需要看起來像這樣:

SessionId       URL          Referer 
5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288511/ somewebsite.com/abc/detail/110302288511/110302288512/ 
5b8cc8794a02ba868db21faef3 somewebsite.com/abc/detail/110302288513/ somewebsite.com/abc/detail/110302288513/1103022815/ 

回答

1

你可以把你的數據在漫長的格式,然後處理ID在data.table之內。

library(reshape2) 
dat <- do.call(rbind,lapply(list(pattern,dataset),function(x) 
          melt(x,id.vars='SessionId'))) 
library(data.table) 
DT <- data.table(dat,key='SessionId') 

DT[,if(.N ==2) 
     if(length(grep(value[1],value[2]))>0) as.list(value) 
    ,by='SessionId'] 

        SessionId          V1             V2 
1: 5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288511/ somewebsite.com/abc/detail/110302288511/110302288512/ 
2: 5b8cc8794a02ba868db21faef3 somewebsite.com/abc/detail/110302288513/ somewebsite.com/abc/detail/110302288513/1103022815/ 

EDIT基準與OP數據的2種溶液(懶惰創建一個大的樣品數據集)。 eddi解決方案快3倍。結果是預期的,我的解決方案比較慢,因爲它使用了一個額外的步驟,用reshape2(稍微慢一點)重整數據。

microbenchmark(eddi(),agstudy(),times=100) 
Unit: milliseconds 
     expr  min  lq median  uq  max neval 
    eddi() 3.232808 3.427557 3.553092 3.768891 8.665698 100 
agstudy() 9.998795 10.615281 11.208633 12.438759 129.517833 100 

這裏基準使用的代碼:

library(inline) 
library(Rcpp) 
library(reshape2) 

eddi <- function(){ 
    library(data.table) 
    pattern = data.table(pattern, key = 'SessionId') 
    dataset = data.table(dataset, key = 'SessionId') 
    dataset[pattern, nomatch = 0][string_compare(URL, Referer) == 1] 
} 

agstudy <- function(){ 
    dat <- do.call(rbind,lapply(list(pattern,dataset),function(x) 
    melt(x,id.vars='SessionId'))) 
    library(data.table) 
    DT <- data.table(dat,key='SessionId') 

    DT[,if(.N ==2) 
    if(length(grep(value[1],value[2]))>0) as.list(value) 
    ,by='SessionId'] 

} 

library('microbenchmark') 
microbenchmark(eddi(),agstudy(),times=100) 

EDIT2到mangae複製的情況下,最好是使用廣泛的格式。受@eddit功能的啓發,在這裏我的版本沒有創建Rcpp函數。

pattern = data.table(pattern, key = 'SessionId') 
    dataset = data.table(dataset, key = 'SessionId') 
    dataset[pattern, nomatch = 0][mapply(grep,URL,Referer)==1] 

PS我這個基準與一個功能EDDI,而後者仍略快

microbenchmark(eddi(),agstudy(),times=100) 
Unit: milliseconds 
     expr  min  lq median  uq  max neval 
    eddi() 3.684126 3.819901 4.007634 4.395048 8.490101 100 
agstudy() 4.057697 4.250171 4.595298 4.835747 8.581503 100 
+0

你應該在'grep'中用''''''Referer'''替換'dd $ Referer',這就是爲什麼你得到不同的結果;不幸的是,我不明白這是如何改善OP - 你只是用一個應用循環替換for循環 – eddi

+0

@eddi我完全修改現在可能會回答。 – agstudy

+0

我很想看到這個vs Rcpp的基準(我現在不能這樣做) – eddi

1

我不認爲R中所需的字符串向量比較函數存在,但你可以只寫你的。請注意,有各種檢查,應該在下面的代碼,特別是如果一個人想使用string_compare功能這個問題之外,我就不說了(如檢查,如果兩個向量具有相同的長度):

library(inline) 
library(Rcpp) 

string_compare = cxxfunction(signature(x = 'character', y = 'character'), ' 
    CharacterVector a(x), b(y); 
    NumericVector res(a.size(), 1.0); 

    for (int i = 0, size = a.size(); i < size; ++i) { 
    int alen = a[i].size(); 
    int blen = b[i].size(); 
    if (alen > blen) { 
     res[i] = 0; 
     continue; 
    } 
    for (int j = 0; j < alen; ++j) { 
     if (a[i][j] != b[i][j]) { 
     res[i] = 0; 
     break; 
     } 
    }  
    } 

    return res; 
', plugin = 'Rcpp') 

library(data.table) 
pattern = data.table(pattern, key = 'SessionId') 
dataset = data.table(dataset, key = 'SessionId') 

dataset[pattern, nomatch = 0][string_compare(URL, Referer) == 1] 
#     SessionId            Referer          URL 
#1: 5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288511/110302288512/ somewebsite.com/abc/detail/110302288511/ 
#2: 5b8cc8794a02ba868db21faef3 somewebsite.com/abc/detail/110302288513/1103022815/ somewebsite.com/abc/detail/110302288513/ 
+0

我對2種解決方案進行了基準測試。 – agstudy

+0

@eddi - 感謝您的快速回復。其實我試圖運行你的解決方案,但由於某種原因,即使試圖運行函數「string_compare」,我也會得到以下錯誤: '簽名錯誤(x =「字符」,y =「字符」): 未使用的參數(y =「字符」)' –

+0

@DevPatel並且您沒有在加載'inline'或'Rcpp'時出錯?如果你複製粘貼上面的代碼,我不知道你爲什麼會得到這個錯誤,對不起。 – eddi

相關問題