2017-03-12 78 views
1

我有兩個大型數據集,一個大約50萬條記錄,另一個大約70K。這些數據集具有地址。如果較小數據集中的任何地址存在於較大數據集中,我想匹配。正如你所想象的,地址可以用不同的方式和不同的案例/拼寫等書寫。除了這個地址可以複製,如果只寫到建築物的水平。所以不同的單位有相同的地址。我做了一些研究並找出了可以使用的packagedist。R模糊字符串匹配返回基於匹配字符串的特定列

我做了一些工作,並設法根據距離獲得最接近的匹配。但是,我無法返回地址匹配的相應列。

下面是一個代碼一起樣本僞數據,我已經創建說明情況

library(stringdist) 
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr") 
Year1 <- c(2001:2007) 

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR") 
Year2 <- c(2001:2010) 

df1 <- data.table(Address1,Year1) 
df2 <- data.table(Address2,Year2) 
df2[,unique_id := sprintf("%06d", 1:nrow(df2))] 

fn_match = function(str, strVec, n){ 
    strVec[amatch(str, strVec, method = "dl", maxDist=n,useBytes = T)] 
} 

df1[!is.na(Address1) 
    , address_match := 
     fn_match(Address1, df2$Address2,3) 
    ] 

這將返回我的基礎上,3距離閉弦的比賽,不過我想也有列「在df1中從df2開始的「年」和「unique_id」。這將幫助我知道字符串與df2匹配的是哪一行數據。所以最後我想知道每一行DF1什麼是從DF2基於從DF2匹配行規定,並有具體的「年」「UNIQUE_ID」的距離衣櫃比賽。

我想有一些與合併(左連接)有關,但我不知道如何合併保留重複項並確保我具有與df1(小數據集)相同的行數。

任何一種解決方案都會有所幫助!

+0

不是在我的電腦上的權利,但看到'which.min'包'stringdist( )'從你以前的問題。考慮你想如何處理關係。 – C8H10N4O2

+0

@ C8H10N4O2,謝謝您的建議。是的,which.min有助於瞭解最小值,但在這種情況下,我希望從匹配的字符串中找到幾個相應的列。由於大數據集中存在重複地址,因此我希望unique_id能夠區分匹配的行,然後我可以從大數據集中合併其他所需的列unique_id。 – user1412

+0

@ C8H10N4O2,我真的希望你能提出一些解決方案。即使我們能夠從大數據集中返回匹配字符串的行號,它也應該幫助我然後根據行號合併所需的列。 – user1412

回答

1

你是那裏的方式90%......

你說你要

知道有哪些數據的行字符串由DF2

你只匹配需要了解你已有的代碼。見?amatch

amatch返回xtable最匹配的位置。當存在具有相同最小距離度量的多個匹配時,返回第一個。

換句話說,amatch爲您提供了df2該行的索引(這是你的table)是每個地址的df1最接近的匹配(這是你的x)。您過早地通過返回新地址來包裝此索引。

取而代之,檢索索引本身以查找 unique_id(如果您確信它確實是唯一的ID)用於左連接。

這兩種方法的示意圖:

library(data.table) # you forgot this in your example 
library(stringdist) 
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"), 
        Year1 = 2001:2007) # already a vector, no need to combine 
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"), 
        Year2=2001:2010) 
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater 

# Return position from strVec of closest match to str 
match_pos = function(str, strVec, n){ 
    amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE? 
} 

# Option 1: use unique_id as a key for left join 
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3 
    unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ] 
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options 

# Option 2: use the row index 
df1[!is.na(Address1) | nchar(Address1>0), 
    df2_pos := match_pos(Address1, df2$Address2,3) ] 
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][] 
+0

非常感謝您的解決方案和解釋。這真的有幫助!再一次感謝你。 – user1412

+0

@ user1412歡迎您,如果您需要檢查唯一性,請參閱'?duplicated',如'!anyDuplicated(...)' – C8H10N4O2

+0

謝謝您的支持!我也在探索stringdistmatrix創建矩陣,然後採取最小距離。我已經完成並且代碼正在工作。爲此創建了一個函數。但是現在我需要根據不同地區的面積進行匹配。所以想要在現有功能上有另一個功能。我設法創建一個函數,但功能超過功能發現它很難.....還是很多學習....我已經發布了這個問題。 http://stackoverflow.com/questions/42793833/r-function-for-a-function-to-be-repeated-based-on-column-values請幫助! – user1412

0

下面是使用fuzzyjoin包中的溶液。它使用類似dplyr的語法和stringdist作爲模糊匹配的可能類型之一。

您可以使用stringdist method =「dl」(或其他可能效果更好的方法)。

爲了滿足您的「確保我有同樣的行數爲DF1」的要求,我用了一個大max_dist,然後用dplyr::group_bydplyr::top_n只得到最小距離的最佳匹配。這是由dgrtwo開發的fuzzyjoin。 (希望這將是包裝本身在未來的一部分。)

(我也不得不作出一個假設採取在距離關係的情況下,最大YEAR2。)

代碼:

library(data.table, quietly = TRUE) 
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"), 
        Year1 = 2001:2007) 
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"), 
        Year2=2001:2010) 
df2[,unique_id := sprintf("%06d", .I)] 

library(fuzzyjoin, quietly = TRUE); library(dplyr, quietly = TRUE) 
stringdist_join(df1, df2, 
       by = c("Address1" = "Address2"), 
       mode = "left", 
       method = "dl", 
       max_dist = 99, 
       distance_col = "dist") %>% 
    group_by(Address1, Year1) %>% 
    top_n(1, -dist) %>% 
    top_n(1, Year2) 

結果:

# A tibble: 7 x 6 
# Groups: Address1, Year1 [7] 
           Address1 Year1        Address2 Year2 unique_id dist 
            <chr> <int>        <chr> <int>  <chr> <dbl> 
1     786, GALI NO 5, XYZ 2001     786, GALI NO 4 XYZ 2007 000007  2 
2  rambo, 45, strret 4, atlast, pqr 2002 del, 546, strret2, towards east, pqr 2009 000009 17 
3 23/4, 23RD FLOOR, STREET 2, ABC-E, PQR 2003     23/4, STREET 2, PQR 2010 000010 19 
4     45-B, GALI NO5, XYZ 2004     45B, GALI NO 5, XYZ 2008 000008  2 
5     HECTIC, 99 STREET, PQR 2005     23/4, STREET 2, PQR 2010 000010 11 
6     786, GALI NO 5, XYZ 2006     786, GALI NO 4 XYZ 2007 000007  2 
7  rambo, 45, strret 4, atlast, pqr 2007 del, 546, strret2, towards east, pqr 2009 000009 17