2017-05-02 40 views
1

我有兩個數據表:使用字符串匹配ř匹配數據表

dt1 <- data.table(V1=c("Apple Pear Orange, AAA111", "Grapes Banana Pear .BBB222", "Orange Kiwi Melon ,CCC333.", "Apple DDD444, Pear Orange", "Kiwi Melon Orange, CCC333", "Apple Pear Orange, AAA111", "Tomato Cucumber-EEE222", "Seagull Pigeon ZZZ111"), stringsAsFactors = F) 

dt2 <- data.table(Code=c("AAA111", "AAA222", "AAA333", "AAA444", "AAA555", "AAA666", "BBB111", "BBB222", "BBB333", "BBB444", "BBB555", "BBB666", "CCC111", "CCC222", "CCC333", "CCC444", "CCC555", "CCC666", "DDD111", "DDD222", "DDD333", "DDD444", "DDD555", "DDD666", "EEE111", "EEE222", "EEE333", "EEE444", "EEE555", "EEE666"), stringsAsFactors = F) 
dt2$Ref <- 1:nrow(dt2) 

dt1每一行包含未格式化的串,其包括「代碼」。 dt2包含可以匹配的代碼列表。我所追求的是dt1的每一行中字符串的「代碼」部分的一種方式,並且與dt2中的相應代碼相匹配。如果dt2中沒有匹配的代碼,則返回NA。

這裏是我後輸出的類型:

dt3 <- data.table(V1=c("Apple Pear Orange, AAA111", "Grapes Banana Pear .BBB222", "Orange Kiwi Melon ,CCC333.", "Apple DDD444, Pear Orange", "Kiwi Melon Orange, CCC333", "Apple Pear Orange, AAA111", "Tomato Cucumber-EEE222", "Seagull Pigeon ZZZ111"), Code=c("AAA111", "BBB222", "CCC333", "DDD444", "CCC333", "AAA111", "EEE222", "NA"), Ref=c("1", "8", "15", "22", "15", "1", "26", "NA"), stringsAsFactors = F) 

我已經使用正則表達式嘗試,用grep等,以找到一個解決方案,但沒有得到任何地方。

回答

1

您可以使用regex_left_join從我fuzzyjoin包:

library(fuzzyjoin) 
regex_left_join(dt1, dt2, by = c(V1 = "Code")) 
#>       V1 Code Ref 
#> 1: Apple Pear Orange, AAA111 AAA111 1 
#> 2: Grapes Banana Pear .BBB222 BBB222 8 
#> 3: Orange Kiwi Melon ,CCC333. CCC333 15 
#> 4: Apple DDD444, Pear Orange DDD444 22 
#> 5: Kiwi Melon Orange, CCC333 CCC333 15 
#> 6: Apple Pear Orange, AAA111 AAA111 1 
#> 7:  Tomato Cucumber-EEE222 EEE222 26 
#> 8:  Seagull Pigeon ZZZ111  NA NA 
+0

感謝。我究竟在做什麼? – Chris