模糊匹配下一列中同一行的一列中的行

我想根據另一列在一列中查找信息。所以我在一列中有一些詞，在另一列中有完整的句子。我想知道它是否找到這些句子中的單詞。但有時這些詞不一樣，所以我不能使用SQL like函數。因此，我認爲模糊匹配+某種形式的「喜歡」的數據是這樣的功能將是有益的：模糊匹配下一列中同一行的一列中的行

Names     Sentences 
Airplanes Sarl   Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.    100% ownership of Kidco.Ltd. is the mother company. 
Popsi Co.    Cola Inc. is 50% share of PopsiCo which is part of LaLo.

數據擁有約2000行這需要一個邏輯找到飛機Sarl公司是否確實在句子或者不是，它也適用於Kidco有限公司，它在句子中是'Kidco.Ltd'。

爲簡單起見，我不需要在列中搜索所有語句，只需要查找Kidco Ltd.並在數據框的同一行中搜索它。

我已經嘗試過在Python與： df.apply（拉姆達S：fuzz.ratio（S [ '名稱']，S [ '句']），軸= 1）

但我有很多unicode/ascii錯誤，所以我放棄了，並且想在R中嘗試。有關如何在R中執行此操作的任何建議？我已經看到Stackoverflow上的答案，它可以模糊匹配列中的所有句子，這與我想要的不同。有什麼建議麼？

來源

2017-05-29 Probs

你能向我們提供了答案那模糊匹配的一切？ –

因爲你的桌子很小，你可以嘗試levenshtein距離。說d是距離，n1是col1中的字符數，n2是col2中的字符數。如果名稱完全不在句子中，則距離應該更接近n2，如果距離應該是n2-n1。然後你會定義一個截斷點，我認爲它可能會運行良好。 –

也許嘗試切分+拼音匹配：

library(RecordLinkage) 
library(quanteda) 
df <- read.table(header=T, sep=";", text=" 
Names     ;Sentences 
Airplanes Sarl   ;Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.    ;Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.    ;100% ownership of Kidco.Ltd. is the mother company. 
Popsi Co.    ;Cola Inc. is 50% share of PopsiCo which is part of LaLo. 
Popsi Co.    ;Cola Inc. is 50% share of Popsi Co which is part of LaLo.") 
f <- soundex 
tokens <- tokenize(as.character(df$Sentences), ngrams = 1:2) # 2-grams to catch "Popsi Co" 
tokens <- lapply(tokens, f) 
mapply(is.element, soundex(df$Names), tokens) 
# A614 K324 K324 P122 P122 
# TRUE FALSE TRUE TRUE TRUE

來源

2017-05-29 15:03:20 lukeA

下面是一個使用我在評論中提出的方法解決，在這個例子中它工作得很好：

library("stringdist") 

df <- as.data.frame(matrix(c("Airplanes Sarl","Airplanes-Sàrl is part of Airplanes-Group Sarl.", 
          "Kidco Ltd.","100% ownership of Kidco.Ltd. is the mother company.", 
          "Popsi Co.","Cola Inc. is 50% share of PopsiCo which is part of LaLo.", 
          "some company","It is a truth universally acknowledged...", 
          "Hello world",list(NULL)), 
        ncol=2,byrow=TRUE,dimnames=list(NULL,c("Names","Sentences"))),stringsAsFactors=FALSE) 

null_elements <- which(sapply(df$Sentences,is.null)) 
df$Sentences[null_elements] <- "" # replacing NULLs to avoid errors 
df$dist <- mapply(stringdist,df$Names,df$Sentences) 
df$n2 <- nchar(df$Sentences) 
df$n1 <- nchar(df$Names) 
df$match_quality <- df$dist-(df$n2-df$n1) 
cutoff <- 2 
df$match <- df$match_quality <= cutoff 
df$Sentences[null_elements] <- list(NULL) # setting null elements back to initial value 
df$match[null_elements] <- NA # optional, set to FALSE otherwise, as it will prevent some false positives if Names is shorter than cutoff 

# Names            Sentences dist n2 n1 match_quality match 
# 1 Airplanes Sarl   Airplanes-Sàrl is part of Airplanes-Group Sarl. 33 47 14    0 TRUE 
# 2  Kidco Ltd.  100% ownership of Kidco.Ltd. is the mother company. 42 51 10    1 TRUE 
# 3  Popsi Co. Cola Inc. is 50% share of PopsiCo which is part of LaLo. 48 56 9    1 TRUE 
# 4 some company    It is a truth universally acknowledged... 36 41 12    7 FALSE 
# 5 Hello world              NULL 11 0 11   22 NA

來源

2017-05-29 15:52:28

Moody_Mudskipper，答案真的很好！但是，如果'Sentences'中的數據是NULL，那麼它表示存在TRUE匹配。您可以使用您提供的示例進行嘗試，然後在「名稱」中插入任何內容並將「句子」留空。 – Probs

我認爲它現在應該可以正常工作，儘管我沒有在我的情況下使用TRUE匹配，但如果句子爲NULL，則出現錯誤，請告訴我它是否有效。 –

模糊匹配下一列中同一行的一列中的行

回答

相關問題