從字符串到正則表達式到新字符串

我有一個數據框，其中包含一列雜亂的字符串。每個雜亂的字符串都包含某個國家的名稱。這裏有一個玩具版本：從字符串到正則表達式到新字符串

df <- data.frame(string = c("Russia is cool (2015) ", 
          "I like - China", 
          "Stuff happens in North Korea"), 
       stringsAsFactors = FALSE)

得益於countrycode包，我也有，包括兩個有用的列的第二個數據集：一個爲國名regexs（regex），另一個與相關的國家名稱（country.name）。我們可以加載這組數據是這樣的：

library(countrycode) 
data(countrycode_data)

我想編寫使用正則表達式countrycode_data$regex的df$string各行中發現國名代碼;在countrycode_data$country.name中將該正則表達式與正確的國家/地區名稱相關聯;最後，將該名稱寫入新列中的相關位置，即df$country。執行此操作TBD後，df應該是這樣的：

     string        country 
1  Russia is cool (2015)      Russian Federation 
2    I like - China         China 
3 Stuff happens in North Korea Korea, Democratic People's Republic of

我不能完全換我圍繞如何做到這一點的頭。我嘗試過使用grepl,which,tolower和%in%的各種組合，但我得到的方向或尺寸（或兩者）都是錯誤的。

來源

2017-02-14 ulfelder

我沒有看到在'countrycode_data'數據幀中的'regex'列...編輯，沒關係，我我想我找到了，名爲'country.name.en.regex'？ – rosscova

'countrycode_data'中的相關列只能稱爲'regex'。具有專有名稱的相關列是「country.name」。 – ulfelder

可能是這樣的東西可以幫助：http://stackoverflow.com/questions/21165256/r-merge-data-frames-allow-inexact-id-matching-eg-with-additional-characters – Bulat

我將與去循環在這個的情況下，但顯着循環countrycode_data data.frame的行，因爲它只有大約200行，而現實世界的原始數據可能會更大的數量級。

因爲長的名字，我提取國家代碼兩列數據：

patt <- countrycode_data$country.name.en.regex[!is.na(countrycode_data$country.name.en.regex)] 
name <- countrycode_data$country.name.en[!is.na(countrycode_data$country.name.en.regex)]

然後我們就可以循環寫入新柱：

for(i in seq_along(patt)) { 
    df$country[grepl(patt[i], df$string, ignore.case=TRUE, perl=TRUE)] <- name[i] 
}

正如其他人所指出的那樣，朝鮮與國家代碼數據中指定的正則表達式不匹配。

來源

2017-02-14 21:36:01

優雅，謝謝。（而且，事實上，我實際上也得到了「朝鮮」的預期結果。） – ulfelder

是的，好的想法。我想'stringi'，就像'which（sapply（countrycode_data $ country.name.en.regex，stringi :: stri_detect_regex，str = tolower（df $ string）），arr.ind = TRUE）'（其中'col'是'countrycode_data $ country.name.en'內的行索引） –

@DavidArenburg也是一個很好的選擇。最後，你必須以某種方式製作一個（且只有一個）循環。 stringi可能顯着提升正則表達式的匹配度（當然也可以採用我的方法） –

這是一個可行的解決方案，但是我在countrycode_data框架中引用了不同的列名，因爲它們在我的系統上出現了不同。我也使用了幾個*apply來電，這可能並不理想。我敢肯定，你可以將其中的一些向量化，我只是不確定自己。

matches <- sapply(df$string, function(x) { 

    # find matches by running all regex strings (maybe cound be vectorised?) 
    find.match <- lapply(countrycode_data$country.name.en.regex, grep, x = x, ignore.case = TRUE, perl = TRUE) 

    # note down which patterns came up with a match 
    matches <- which(sapply(find.match, length) > 0) 

    # now cull the matches list down to only those with a match 
    find.match <- find.match[ sapply(find.match, length) > 0 ] 

    # get rid of NA matches (not sure why these come up) 
    matches <- matches[ sapply(find.match, is.na) == FALSE ] 

    # now only return the value (reference to the match) if there is one (otherwise we get empty returns) 
    ifelse(length(matches) == 0, NA_integer_, matches) 
}) 

# now use the vector of references to match up country names 
df$country <- countrycode_data$country.name.en[ matches ] 

> df 
         string   country 
1  Russia is cool (2015) Russian Federation 
2    I like - China    China 
3 Stuff happens in North Korea    <NA>

grepl("^(?=.*democrat|people|north|d.*p.*.r).*\\bkorea|dprk|korea.*(d.*p.*r)", 
     c("korea", "north korea", "aaa north korea"), 
     perl = TRUE, ignore.case = TRUE) 
# [1] FALSE TRUE FALSE

來源

2017-02-14 21:17:24 rosscova

這裏是交叉聯接（這會吹你的數據），可能的解決方案

library(countrycode) 
data(countrycode_data) 

library(data.table) 
df <- data.table(string = c("Russia is cool (2015) ", 
          "I like - China", 
          "Stuff happens in North Korea"), 
       stringsAsFactors = FALSE) 

# adding dummy for full cross-join merge 
df$dummy <- 0L 
country.dt <- data.table(countrycode_data[, c("country.name.en", "country.name.en.regex")]) 
country.dt$dummy <- 0L 

# merging original data to countries to get all possible combinations 
res.dt <- merge(df, country.dt, by ="dummy", all = TRUE, allow.cartesian = TRUE) 

# there are cases with NA regex 
res.dt <- res.dt[!is.na(country.name.en.regex)] 

# find matches 
res.dt[, match := grepl(country.name.en.regex, string, perl = T, ignore.case = T), by = 1:nrow(res.dt)] 

# filter out matches 
res.dt <- res.dt[match == TRUE, .(string, country.name.en)] 
res.dt 

#     string country.name.en 
# 1: Russia is cool (2015) Russian Federation 
# 2:   I like - China    China

來源

2017-02-14 21:19:23 Bulat

爲什麼交叉連接，如果你最終只是通過行操作？可以做一個簡單的'sapply'國際海事組織。 –

我同意，在這種特殊情況下，它不是一個很好的解決方案，因爲預期的比賽數量很少。但它可以用於其他類似的任務 – Bulat

這正是國家代碼包的目的，所以沒有理由自己重新編碼。像這樣使用它...

library(countrycode) 
df <- data.frame(string = c("Russia is cool (2015) ", "I like - China", 
          "Stuff happens in North Korea"), stringsAsFactors = FALSE) 

df$country.name <- countrycode(df$string, 'country.name', 'country.name')

特別是在這種情況下，也不會找到一個明確的匹配「的東西，在朝鮮發生」，但實際上是用正則表達式對朝鮮和韓國的問題（我開的一個問題了，在這裏https://github.com/vincentarelbundock/countrycode/issues/139）。否則，你想要做的事情應該是原則性的。

（旁註專門@ulfelder：中countrycode新版本剛剛發佈的CRAN，v0.19列名稱已更改了一下，因爲我們增加了新的語言，所以country.name現在country.name.en和regex現在country.name.en.regex。）

來源

2017-02-19 09:41:46

我是countrycode維護者。 @ cj-yetman給出了正確的答案。您遇到的具體朝鮮問題現在已在Github的countrycode開發版本中得到修復。

您可以使用COUNTRYCODE直接把句子轉化爲國家名稱或代碼：

> library(devtools) 
> install_github('vincentarelbundock/countrycode') 
> library(countrycode) 
> df <- data.frame(string = c("Russia is cool (2015) ", 
+        "I like - China", 
+        "Stuff happens in North Korea"), 
+     stringsAsFactors = FALSE) 
> df$iso3c = countrycode(df$string, 'country.name', 'country.name') 
> df 
         string         iso3c 
1  Russia is cool (2015)      Russian Federation 
2    I like - China         China 
3 Stuff happens in North Korea Democratic People's Republic of Korea

來源

2017-02-19 13:11:35 Vincent

謝謝@Vincent！在某種程度上，我很高興在得到「countrycode」特定的答案之前得到了更一般的答案，因爲在沒有解決問題的包的情況下，這可能會再次出現。 – ulfelder

有沒有一種有效的方法來使用'countrycode'來捕捉單個字符串中的多個國家名？例如，如果我有字符串「祕書長關於蘇丹和南蘇丹的報告」，我想返回一個字符串，如「蘇丹;南蘇丹」？我知道如何做到崩潰。它返回不止一場比賽會讓我感到困惑。 – ulfelder

開箱即用的countrycode，但如果你看看內部代碼，包已經跟蹤多個匹配。你可以使用相同的代碼並捕獲「destination_list」。看到這裏：https://github.com/vincentarelbundock/countrycode/blob/master/R/countrycode.R#L123 – Vincent

從字符串到正則表達式到新字符串

回答

相關問題