使用apache spark與scala的兩個配置單元列之間的模糊比較

我正在讀取2個hive表中的數據。令牌表具有需要與輸入數據匹配的令牌。輸入數據將包含說明欄和其他欄。我需要拆分輸入數據，並且需要將每個拆分元素與令牌表中的所有元素進行比較。目前我正在使用me.xdrop.fuzzywuzzy.FuzzySearch庫進行模糊匹配。使用apache spark與scala的兩個配置單元列之間的模糊比較

下面

是我的代碼snippet-

val tokens = sqlContext.sql("select token from tokens") 
val desc = sqlContext.sql("select description from desceriptiontable") 
val desc_tokens = desc.flatMap(_.toString().split(" "))

現在我需要遍歷desc_tokens和desc_tokens的每個元素應該是模糊有令牌的每個元素匹配，並且它超過85％匹配我需要從更換元件desc_tokens元素從標記中去除。

示例 -

我的令牌列表是

hello 
this 
is 
token 
file 
sample

和我輸入的描述是

helo this is input desc sampl

代碼應該返回

hello this is input desc sample

爲你好和helo模糊匹配> 85％，所以helo將被替換爲你好。同樣對於sampl。

來源

2017-06-28 shashank kulkarni

我就與這個庫的測試：https://github.com/rockymadden/stringmetric

其他的想法（沒有優化）：

//I change order tokens 
val tokens = Array("this","is","sample","token","file","hello"); 
val desc_tokens = Array("helo","this","is","token","file","sampl"); 

val res = desc_tokens.map(str => { 
    //Compute score beetween tokens and desc_tokens 
    val elem = tokens.zipWithIndex.map{ case(tok,index) => (tok,index,JaroMetric.compare(str, tok).get)} 
    //Get token has max score 
    val emax = elem.maxBy{case(_,_,score) => score} 
    //if emax have a score > 0.85 get It. Else keep input 
    if(emax._3 > 0.85) tokens(emax._2) else str 

}) 
res.foreach { println }

我的輸出： hello this is token file sample

來源

2017-06-28 14:17:26 Jeremy

謝謝@Jeremy的答覆。 zipWithIndex會遍歷索引明智。所以如果你好的令牌存在於索引2或3，這個代碼將不起作用。我在尋找的是來自輸入描述的每個標記應該從標記列表中查找所有標記，並從標記列表中返回最匹配或首次匹配的標記（> 85％） –

使用apache spark與scala的兩個配置單元列之間的模糊比較

回答

相關問題