與R匹配的字符串：尋找最佳匹配

我有兩個單詞向量。與R匹配的字符串：尋找最佳匹配

Corpus<- c('animalada', 'fe', 'fernandez', 'ladrillo') 

Lexicon<- c('animal', 'animalada', 'fe', 'fernandez', 'ladr', 'ladrillo')

我需要在詞彙和語料庫之間做出最好的匹配。我嘗試了很多方法。這是其中之一。

library(stringr) 

match<- paste(Lexicon,collapse= '|^') # I use the stemming method (snowball), so the words in Lexicon are root of words 

test<- str_extrac_all (Corpus,match,simplify= T) 

test 

[,1] 
[1,] "animal" 
[2,] "fe" 
[3,] "fe" 
[4,] "ladr"

不過，本場比賽應該是：

[1,] "animalada" 
[2,] "fe" 
[3,] "fernandez" 
[1,] "ladrillo"

相反，與之匹配的是與第一個詞在我的詞彙按字母順序排列。順便說一下，這些向量是我擁有的更大列表的樣本。

我沒有嘗試使用正則表達式（），因爲我不確定它是如何工作的。也許解決方案就是這樣。

你能幫我解決這個問題嗎？感謝您的幫助。

來源

2017-09-23 pch919

您可以通過字符數訂購Lexicon圖案有，按遞減順序，所以最好的比賽是第一位的：

match<- paste(Lexicon[order(-nchar(Lexicon))], collapse = '|^') 

test<- str_extract_all(Corpus, match, simplify= T) 

test 
#  [,1]  
#[1,] "animalada" 
#[2,] "fe"  
#[3,] "fernandez" 
#[4,] "ladrillo"

來源

2017-09-23 01:54:24 Psidom

我正在用真正的Lexicon測試你的答案。我稍後會通知結果。謝謝你們倆 – pch919

您可以只使用match功能。

Index <- match(Corpus, Lexicon) 

Index 
[1] 2 3 4 6 

Lexicon[Index] 
[1] "animalada" "fe" "fernandez" "ladrillo"

來源

2017-09-23 01:59:20 Santosh

我試過這兩種方法，正確的是@Psidorm建議的。如果使用函數match()，則會在單詞的任何部分找到匹配項，而不是開頭的必要項。例如：

Corpus<- c('tambien') 
Lexicon<- c('bien') 
match(Corpus,Lexicon)

結果是'tambien'，但這是不正確的。

再次感謝您的幫助！

來源

2017-09-27 03:16:36 pch919

與R匹配的字符串：尋找最佳匹配

回答

相關問題