2017-10-05 76 views
1

另據幀匹配我是比較新的R.查找所有的字符串從R中

我有一個數據幀locs有1可變V1,看起來像:

V1 
edmonton general hospital 
cardiovascular institute, hospital san carlos, madrid spain 
hospital of santa maria, lisbon, portugal 

,並且有另一個數據幀cities兩個變量如下所示:

city    country 
edmonton   canada 
san carlos  spain 
los angeles  united states 
santa maria  united states 
tokyo    japan 
madrid   spain 
santa maria  portugal 
lisbon   portugal 

我想在locs tha中創建兩個新變量牛逼涉及的V1任意字符串匹配內city使locs看起來是這樣的:

V1           city     country      
edmonton general hospital      edmonton    canada 
hospital san carlos, madrid spain    san carlos, madrid spain 
hospital of santa maria, lisbon, portugal  santa maria, lisbon portugal, united states 

有幾件事情需要注意:V1可能有多個國名。另外,如果有一個重複的國家(例如聖卡洛斯和馬德里都在西班牙),那麼我只想要一個國家的例子。

請指教。

謝謝。

回答

1

使用tidyversestringr的解決方案。 locs2是最終的輸出。

library(tidyverse) 
library(stringr) 

locs2 <- locs %>% 
    rowwise() %>% 
    mutate(city = list(str_match(V1, cities$city))) %>% 
    unnest() %>% 
    drop_na(city) %>% 
    left_join(cities, by = "city") %>% 
    group_by(V1) %>% 
    summarise_all(funs(toString(sort(unique(.))))) 

結果

locs2 %>% as.data.frame() 
                  V1    city     country 
1 cardiovascular institute, hospital san carlos, madrid spain madrid, san carlos     spain 
2         edmonton general hospital   edmonton     canada 
3     hospital of santa maria, lisbon, portugal lisbon, santa maria portugal, united states 

DATA

library(tidyverse) 

locs <- data_frame(V1 = c("edmonton general hospital", 
        "cardiovascular institute, hospital san carlos, madrid spain", 
        "hospital of santa maria, lisbon, portugal")) 

cities <- read.table(text = "city    country 
edmonton   canada 
'san carlos'  spain 
'los angeles'  'united states' 
'santa maria'  'united states' 
tokyo    japan 
madrid   spain 
'santa maria'  portugal 
lisbon   portugal", 
        header = TRUE, stringsAsFactors = FALSE) 
相關問題