2012-05-16 66 views
6

我有兩個數據集合併與部分匹配的數據r中

datf1 <- data.frame (name = c("regular", "kklmin", "notSo", "Jijoh", 
"Kish", "Lissp", "Kcn", "CCCa"), 
number1 = c(1, 8, 9, 2, 18, 25, 33, 8)) 
#----------- 
    name number1 
1 regular  1 
2 kklmin  8 
3 notSo  9 
4 Jijoh  2 
5 Kish  18 
6 Lissp  25 
7  Kcn  33 
8 CCCa  8 

datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean", "LiSsp", 
"KcN", "CaPN"), 
    number2 = c(2, 8, 12, 13, 20, 18, 13)) 
#------------- 
    name number2 
1 reGulr  2 
2 ntSo  8 
3 Jijoh  12 
4 sean  13 
5 LiSsp  20 
6 KcN  18 
7 CaPN  13 

我想的名字列合併它們,但是有部分匹配允許(避免妨礙合併在大型數據拼寫錯誤設置,甚至檢測這樣的拼寫錯誤),並且例如

(1)如果連續的四個字母(所有如果的字母數是小於4)在任何位置 - 匹配是細

ABBCD = BBCDK = aBBCD = ramABBBCD = ABB 

(2)情況森例如ABBCD = aBbCd

(3)新數據集將保留兩個名稱(來自datf1和datf2的名稱)。所以這封信我們可以檢測出這個匹配是否完美(可能是一個單獨的列與多少個字母匹配)

這樣的合併可能嗎?

編輯:

datf1 <- data.frame (name = c("xxregular", "kklmin", "notSo", "Jijoh", 
      "Kish", "Lissp", "Kcn", "CCCa"), 
        number1 = c(1, 8, 9, 2, 18, 25, 33, 8)) 
datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean", 
      "LiSsp", "KcN", "CaPN"), 
        number2 = c(2, 8, 12, 13, 20, 18, 13)) 


uglyMerge(datf1, datf2) 
     name1 name2 number1 number2 matches 
1 xxregular <NA>  1  NA  0 
2  kklmin <NA>  8  NA  0 
3  notSo <NA>  9  NA  0 
4  Jijoh Jijoh  2  12  5 
5  Kish <NA>  18  NA  0 
6  Lissp LiSsp  25  20  5 
7  Kcn KcN  33  18  3 
8  CCCa <NA>  8  NA  0 
9  <NA> reGulr  NA  2  0 
10  <NA> ntSo  NA  8  0 
11  <NA> sean  NA  13  0 
12  <NA> CaPN  NA  13  0 
+0

試圖修復一些格式。我看到你添加了一個「醜陋的」的副本,看起來是來自@sgibb的回覆。 'xxregular'和'reGulr'缺乏匹配可能對您很明顯,但您可能需要向我們解釋它,因爲它似乎符合您的規範 –

回答

7

也許有一個簡單的解決方案,但我找不到任何。
恕我直言,你必須爲你自己實現這種合併。
請找一個醜陋的例子如下(有改進有很大的空間):

uglyMerge <- function(df1, df2) { 

    ## lower all strings to allow case-insensitive comparison 
    lowerNames1 <- tolower(df1[, 1]); 
    lowerNames2 <- tolower(df2[, 1]); 

    ## split strings into single characters 
    names1 <- strsplit(lowerNames1, ""); 
    names2 <- strsplit(lowerNames2, ""); 

    ## create the final dataframe 
    mergedDf <- data.frame(name1=as.character(df1[,1]), name2=NA, 
         number1=df1[,2], number2=NA, matches=0, 
         stringsAsFactors=FALSE); 

    ## store names of dataframe2 (to remember which strings have no match) 
    toMerge <- df2[, 1]; 

    for (i in seq(along=names1)) { 
     for (j in seq(along=names2)) { 
      ## set minimal match to 4 or to string length 
      minMatch <- min(4, length(names2[[j]])); 

      ## find single matches 
      matches <- names1[[i]] %in% names2[[j]]; 

      ## look for consecutive matches 
      r <- rle(matches); 

      ## any matches found? 
      if (any(r$values)) { 
       ## find max consecutive match 
       possibleMatch <- r$value == TRUE; 
       maxPos <- which(which.max(r$length[possibleMatch]) & possibleMatch)[1]; 

       ## store max conscutive match length 
       maxMatch <- r$length[maxPos]; 

       ## to remove FALSE-POSITIVES (e.g. CCC and kcn) find 
       ## largest substring 
       start <- sum(r$length[0:(maxPos-1)]) + 1; 
       stop <- start + r$length[maxPos] - 1; 
       maxSubStr <- substr(lowerNames1[i], start, stop); 

       ## all matching criteria fulfilled 
       isConsecutiveMatch <- maxMatch >= minMatch && 
            grepl(pattern=maxSubStr, x=lowerNames2[j], fixed=TRUE) && 
            nchar(maxSubStr) > 0; 

       if (isConsecutiveMatch) { 
        ## merging 
        mergedDf[i, "matches"] <- maxMatch 
        mergedDf[i, "name2"] <- as.character(df2[j, 1]); 
        mergedDf[i, "number2"] <- df2[j, 2]; 

        ## don't append this row to mergedDf because already merged 
        toMerge[j] <- NA; 

        ## stop inner for loop here to avoid possible second match 
        break; 
       } 
      } 
     } 
    } 

    ## append not matched rows to mergedDf 
    toMerge <- which(df2[, 1] == toMerge); 
    df2 <- data.frame(name1=NA, name2=as.character(df2[toMerge, 1]), 
        number1=NA, number2=df2[toMerge, 2], matches=0, 
        stringsAsFactors=FALSE); 
    mergedDf <- rbind(mergedDf, df2); 

    return (mergedDf); 
} 

輸出:

> uglyMerge(datf1, datf2) 
    name1 name2 number1 number2 matches 
1 xxregular reGulr  1  2  5 
2  kklmin <NA>  8  NA  0 
3  notSo <NA>  9  NA  0 
4  Jijoh Jijoh  2  12  5 
5  Kish <NA>  18  NA  0 
6  Lissp LiSsp  25  20  5 
7  Kcn KcN  33  18  3 
8  CCCa <NA>  8  NA  0 
9  <NA> ntSo  NA  8  0 
10  <NA> sean  NA  13  0 
11  <NA> CaPN  NA  13  0 
+0

感謝您提供優秀的解決方案。它可以用我提供的例子工作。但我可能需要進一步調整,因爲它似乎不工作,如果它不是前四個連續的字母,只是看到我的編輯,我在常規之前添加xxx,不匹配。不過,它會給我提供很好的開始,謝謝! – jon

+0

@hijo對不起,我的子串計算中有一些錯誤。請使用我的修改版本。 – sgibb

+0

使用某些[編輯距離](http://en.wikipedia.org/wiki/Levenshtein_distance)進行字符串匹配可能也很有價值。它在R中有[implementation](http://stackoverflow.com/questions/3182091/fast-levenshtein-distance-in-r)。 –

3

agrep將讓你開始。

類似:

lapply(tolower(datf1$name), function(x) agrep(x, tolower(datf2$name))) 

那麼你可以直到你得到匹配適量調整max.distance參數。然後合併,但你喜歡。