我需要用存儲在數據框中的一些匹配來替換字符串的子集。R - 如何從多個匹配中替換字符串（在數據框中）

例如 -

input_string = "Whats your name and Where're you from"

我需要更換從數據幀這個字符串的一部分。說的數據幀

matching <- data.frame(from_word=c("Whats your name", "name", "fro"), 
      to_word=c("what is your name","names","froth"))

輸出預計爲你叫什麼名字和你是哪兒

注 -

這是最大的字符串相匹配。在此示例中，名稱與名稱不匹配，因爲名稱是較大匹配的一部分
它必須匹配整個字符串而不是部分字符串。來來往往「從」不應該匹配爲「泡沫」

我提到了下面的鏈接，但不知何故無法獲得這項工作旨在/上述

Match and replace multiple strings in a vector of text without looping in R

描述這是我的第一篇文章這裏。如果我還沒有給予足夠的細節，請讓我知道

來源

2017-03-24 Sri

編輯

根據來自斯里蘭卡的評論輸入我會建議使用：

library(gsubfn) 
# words to be replaced 
a <-c("Whats your","Whats your name", "name", "fro") 
# their replacements 
b <- c("What is yours","what is your name","names","froth") 
# named list as an input for gsubfn 
replacements <- setNames(as.list(b), a) 
# the test string 
input_string = "fro Whats your name and Where're name you from to and fro I Whats your" 
# match entire words 
gsubfn(paste(paste0("\\w*", names(replacements), "\\w*"), collapse = "|"), replacements, input_string)

原始

我不會說這是比較容易閱讀比你的簡單循環，但它可能需要更好地照顧重疊的替換：

# define the sample dataset 
input_string = "Whats your name and Where're you from" 
matching <- data.frame(from_word=c("Whats your name", "name", "fro", "Where're", "Whats"), 
         to_word=c("what is your name","names","froth", "where are", "Whatsup")) 

# load used library 
library(gsubfn) 

# make sure data is of class character 
matching$from_word <- as.character(matching$from_word) 
matching$to_word <- as.character(matching$to_word) 

# extract the words in the sentence 
test <- unlist(str_split(input_string, " ")) 
# find where individual words from sentence match with the list of replaceble words 
test2 <- sapply(paste0("\\b", test, "\\b"), grepl, matching$from_word) 
# change rownames to see what is the format of output from the above sapply 
rownames(test2) <- matching$from_word 
# reorder the data so that largest replacement blocks are at the top 
test3 <- test2[order(rowSums(test2), decreasing = TRUE),] 
# where the word is already being replaced by larger chunk, do not replace again 
test3[apply(test3, 2, cumsum) > 1] <- FALSE 

# define the actual pairs of replacement 
replacements <- setNames(as.list(as.character(matching[,2])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1]), 
         as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1]) 

# perform the replacement 
gsubfn(paste(as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1], collapse = "|"), 
     replacements,input_string)

來源

2017-03-24 14:21:57 ira

謝謝@ira。我在你的代碼中注意到了兩點：1.我用具有超過4500行的匹配數據幀進行測試。我的循環方式在0.2秒內執行，上面的代碼耗時0.4秒。和2.我認爲你的代碼期望字符串的順序與a相同。例如，如果我將輸入字符串設置爲「往復名稱」 - 您的代碼會將來自泡沫的代碼替換爲名稱，但不會將名稱替換爲名稱。我認爲這是因爲排序？我不確定。 – Sri

@Sri錯誤是因爲在代碼中，我沒有打算驗證整個模式是匹配還是隻是其中的一部分。但是現在我已經提出了更加優雅的方式，與Alekandr Voitov的回答一致，這應該從他的回答中解決問題。 – ira

謝謝@ira。有用。我必須補充說，如果替換「from」和「to」的行數很少，它就可以工作。但是當我嘗試用60,000行替換時，它錯誤地無法編譯正則表達式。所以現在，我將繼續使用循環來解決我自己的問題，直到有人更好（通過刪除循環） – Sri

toreplace =list("x1" = "y1","x2" = "y2", ..., "xn" = "yn")

函數有兩個參數喜和義。

xi是模式（找到什麼），
yi是替換（替換）。

input_string = "Whats your name and Where're you from" 
toreplace<-list("Whats your name" = "what is your name", "names" = "name", "fro" = "froth") 
gsubfn(paste(names(toreplace),collapse="|"),toreplace,input_string)

來源

2017-03-24 12:17:58

感謝Aleksandr Voitov爲您的迴應。我認爲「from」中的「fro」也正在被替換。這是我得到你的名字和Where're你**泡沫**的答案 – Sri

沒有問題@Sri。這是非常有用的功能，它可以防止多次使用gsub（），當你想用有意義的東西替換特定的字符串時。 –

哦，對不起，我的評論可能很清楚。在你的代碼中，來自**的最後一個**不應該變成**泡沫**。有沒有辦法阻止請求 – Sri

正在嘗試不同的事情，下面的代碼似乎工作。

a <-c("Whats your name", "name", "fro") 
b <- c("what is your name","names","froth") 
c <- c("Whats your name and Where're you from") 

for(i in seq_along(a)) c <- gsub(paste0('\\<',a[i],'\\>'), gsub(" ","_",b[i]), c) 
c <- gsub("_"," ",c) 
c

接過幫助從下面的鏈接Making gsub only replace entire words?

不過，我想如果可能的話，避免循環。是否有人可以改善這個答案，沒有循環

來源

2017-03-24 14:02:30 Sri

R - 如何從多個匹配中替換字符串（在數據框中）

回答

編輯

原始

相關問題