編輯
根據來自斯里蘭卡的評論輸入我會建議使用:
library(gsubfn)
# words to be replaced
a <-c("Whats your","Whats your name", "name", "fro")
# their replacements
b <- c("What is yours","what is your name","names","froth")
# named list as an input for gsubfn
replacements <- setNames(as.list(b), a)
# the test string
input_string = "fro Whats your name and Where're name you from to and fro I Whats your"
# match entire words
gsubfn(paste(paste0("\\w*", names(replacements), "\\w*"), collapse = "|"), replacements, input_string)
原始
我不會說這是比較容易閱讀比你的簡單循環,但它可能需要更好地照顧重疊的替換:
# define the sample dataset
input_string = "Whats your name and Where're you from"
matching <- data.frame(from_word=c("Whats your name", "name", "fro", "Where're", "Whats"),
to_word=c("what is your name","names","froth", "where are", "Whatsup"))
# load used library
library(gsubfn)
# make sure data is of class character
matching$from_word <- as.character(matching$from_word)
matching$to_word <- as.character(matching$to_word)
# extract the words in the sentence
test <- unlist(str_split(input_string, " "))
# find where individual words from sentence match with the list of replaceble words
test2 <- sapply(paste0("\\b", test, "\\b"), grepl, matching$from_word)
# change rownames to see what is the format of output from the above sapply
rownames(test2) <- matching$from_word
# reorder the data so that largest replacement blocks are at the top
test3 <- test2[order(rowSums(test2), decreasing = TRUE),]
# where the word is already being replaced by larger chunk, do not replace again
test3[apply(test3, 2, cumsum) > 1] <- FALSE
# define the actual pairs of replacement
replacements <- setNames(as.list(as.character(matching[,2])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1]),
as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1])
# perform the replacement
gsubfn(paste(as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1], collapse = "|"),
replacements,input_string)
謝謝@ira。我在你的代碼中注意到了兩點:1.我用具有超過4500行的匹配數據幀進行測試。我的循環方式在0.2秒內執行,上面的代碼耗時0.4秒。和2.我認爲你的代碼期望字符串的順序與a相同。例如,如果我將輸入字符串設置爲「往復名稱」 - 您的代碼會將來自泡沫的代碼替換爲名稱,但不會將名稱替換爲名稱。我認爲這是因爲排序?我不確定。 – Sri
@Sri錯誤是因爲在代碼中,我沒有打算驗證整個模式是匹配還是隻是其中的一部分。但是現在我已經提出了更加優雅的方式,與Alekandr Voitov的回答一致,這應該從他的回答中解決問題。 – ira
謝謝@ira。有用。我必須補充說,如果替換「from」和「to」的行數很少,它就可以工作。但是當我嘗試用60,000行替換時,它錯誤地無法編譯正則表達式。所以現在,我將繼續使用循環來解決我自己的問題,直到有人更好(通過刪除循環) – Sri