stringr提取基於資金和位置的文本

我試圖從字符串中提取一些單詞（國家名稱）。該字符串列表中的元素，例如，stringr提取基於資金和位置的文本

myList <- list(associations = c("Madeup speciesone: \r\n\t\t\t\t", "Foobarae foobar: Russia - 123,", 
           "Foobarus foobar France - 7007,Italy - 7007,Portugal - 6919,Ukraine - 42264,Russia - 7009,", 
           "Foobarus foobarbar", 
           "Foobaria foobariana f. sp. foobaricol Japan - 254, China - 256,"))

我想提取的國家名稱，例如，我可以這樣做：

Country_name <- lapply(myList, pattern = "China|France|Italy|Ukraine", str_extract_all) 
country_list <- vector() 
for(i in 1:length(Country_name[[1]])){ 
    country_list[i] <- paste(Country_name[[1]][[i]], collapse = ",") 
}

，但需要列出所有可能的國家爲它工作，這似乎很辛苦。

有沒有使用正則表達式來提取所有國家名稱的方法？就像從第二個大寫的單詞開始，然後提取所有國家直到字符串結束？

因爲物種名稱的可變長度（例如Foobaria foobariana f.sp. foobaricol），使用類似lapply(myList, word, 3)的東西並不完全有效。

# desired output 
country_list <- c("","Russia","France,Italy,Portugal,Ukraine,Russia","","Japan,China")

來源

2016-10-10 nofunsally

請查看是否'myList'對象是你的原意。原帖中沒有'list'部分，我編輯假設它是需要的。 – nicola

通過編輯myList，您可以嘗試：'lapply（str_extract_all（myList $ associations，「（？！^）[A-Z] \\ w +」），paste，collapse =「，」）'。 – nicola

@nicola myList編輯是我的意圖。你的代碼有效。 '\\ w +'是一個字的邊界，對嗎？和 – nofunsally

您可以使用包countrycode

library(countrycode) 
countries <- as.data.frame(countrycode_data$country.name)

提取的國名。如果你要堅持你的代碼，你可以創建一個包含由分隔的所有國家名稱的字符串「|」

all <- paste(countrycode_data$country.name, collapse="|")

然後運行

Country_name <- lapply(myList, pattern = all, str_extract_all) 

country_list <- vector() 
for(i in 1:length(Country_name[[1]])){ 
country_list[i] <- paste(Country_name[[1]][[i]], collapse = ",") 
}

應該給你的結果：

myList <- list(associations = c("Madeup speciesone: \r\n\t\t\t\t", "Foobarae foobar: Russia - 123,", 
          "Foobarus foobar France - 7007,Italy - 7007,Portugal - 6919,Ukraine - 42264,Russia - 7009,", 
          "Foobarus foobarbar", 
          "Foobaria foobariana f. sp. foobaricol Japan - 254, China - 256,", 
          "Germany", 
          "555Senegal")) 

Country_name <- lapply(myList, pattern = all, str_extract_all) 

country_list <- vector() 

for(i in 1:length(Country_name[[1]])){ 
country_list[i] <- paste(Country_name[[1]][[i]], collapse = ",") 
} 

country_list 
[1] ""   ""    "France,Italy,Portugal,Ukraine" 
[4] ""   "Japan,China"  "Germany"      
[7] "Senegal"

來源

2016-10-10 17:34:44 rfsrc

stringr提取基於資金和位置的文本

回答

相關問題