優化Jaro-Winkler算法

我從this網站獲取Jaro-Winkler算法的代碼。我需要運行150,000次以獲得差異之間的距離。這需要很長時間，因爲我在Android移動設備上運行。優化Jaro-Winkler算法

它可以優化更多嗎？

public class Jaro { 
    /** 
    * gets the similarity of the two strings using Jaro distance. 
    * 
    * @param string1 the first input string 
    * @param string2 the second input string 
    * @return a value between 0-1 of the similarity 
    */ 
    public float getSimilarity(final String string1, final String string2) { 

     //get half the length of the string rounded up - (this is the distance used for acceptable transpositions) 
     final int halflen = ((Math.min(string1.length(), string2.length()))/2) + ((Math.min(string1.length(), string2.length())) % 2); 

     //get common characters 
     final StringBuffer common1 = getCommonCharacters(string1, string2, halflen); 
     final StringBuffer common2 = getCommonCharacters(string2, string1, halflen); 

     //check for zero in common 
     if (common1.length() == 0 || common2.length() == 0) { 
      return 0.0f; 
     } 

     //check for same length common strings returning 0.0f is not the same 
     if (common1.length() != common2.length()) { 
      return 0.0f; 
     } 

     //get the number of transpositions 
     int transpositions = 0; 
     int n=common1.length(); 
     for (int i = 0; i < n; i++) { 
      if (common1.charAt(i) != common2.charAt(i)) 
       transpositions++; 
     } 
     transpositions /= 2.0f; 

     //calculate jaro metric 
     return (common1.length()/((float) string1.length()) + 
       common2.length()/((float) string2.length()) + 
       (common1.length() - transpositions)/((float) common1.length()))/3.0f; 
    } 

    /** 
    * returns a string buffer of characters from string1 within string2 if they are of a given 
    * distance seperation from the position in string1. 
    * 
    * @param string1 
    * @param string2 
    * @param distanceSep 
    * @return a string buffer of characters from string1 within string2 if they are of a given 
    *   distance seperation from the position in string1 
    */ 
    private static StringBuffer getCommonCharacters(final String string1, final String string2, final int distanceSep) { 
     //create a return buffer of characters 
     final StringBuffer returnCommons = new StringBuffer(); 
     //create a copy of string2 for processing 
     final StringBuffer copy = new StringBuffer(string2); 
     //iterate over string1 
     int n=string1.length(); 
     int m=string2.length(); 
     for (int i = 0; i < n; i++) { 
      final char ch = string1.charAt(i); 
      //set boolean for quick loop exit if found 
      boolean foundIt = false; 
      //compare char with range of characters to either side 

      for (int j = Math.max(0, i - distanceSep); !foundIt && j < Math.min(i + distanceSep, m - 1); j++) { 
       //check if found 
       if (copy.charAt(j) == ch) { 
        foundIt = true; 
        //append character found 
        returnCommons.append(ch); 
        //alter copied string2 for processing 
        copy.setCharAt(j, (char)0); 
       } 
      } 
     } 
     return returnCommons; 
    } 
}

我提到的是，在整個過程中，我使腳本的只是實例，所以只有一次

jaro= new Jaro();

如果你要測試和需要的例子所以不破壞腳本，你會發現它here，在另一個蟒蛇優化線程

來源

2010-05-17 Pentium10

是的，但你不會享受它。用構造函數中分配的字符數組替換所有那些ed StringBuffers，並且不再使用整數索引來跟蹤其中的內容。

This pending Commons-Lang patch會給你一些味道。

來源

2010-05-17 12:36:52 bmargulies

我對此持懷疑態度，但我跑了一些測試，看起來char數組的真正大約是StringBuffer的十倍。如果你想避免使用實際的字符數組，StringBuilder的速度只有char數組的兩倍左右。 – Rubys 2010-05-17 22:02:09

你可以發佈該代碼片段來幫助指導。 – Pentium10 2010-05-18 07:30:57

嘗試避免getCommonCharacters循環中的兩個嵌套循環。
有關如何的建議：將所有字符存儲在某種映射的較小字符串中（java有幾個），其中鍵是字符，值是位置，這樣您仍然可以計算距離，他們是否有共同點。我不太瞭解算法，但我認爲這是可行的。
除了那個和bmargulies的答案，我真的沒有看到進一步的優化，除了比特等東西之外。如果這真的很關鍵，考慮用C重寫這個部分？

來源

2010-05-17 21:49:30 Rubys

我知道這個問題可能已經解決了一段時間，但我想對算法本身發表評論。當比較一個字符串與它自己時，答案結果是1/| string |關閉。當比較稍微不同的值時，這些值也變小。

解決方法是在getCommonCharacters方法的內部for-statement中將'm-1'調整爲'm'。該代碼然後像魅力:)

請參閱http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance以及一些示例。

來源

2010-10-27 19:05:37 mvantol1

我不太瞭解Android以及它如何與數據庫一起工作。 WP7有（將有:)）SQL CE。下一步通常是處理您的數據。添加字符串長度並限制您的比較。在兩列上添加索引，然後按值分類。長度上的索引也應該排序。我使用了15萬條醫療術語在舊服務器上運行，在0.5秒內給我提供了建議和拼寫檢查，用戶幾乎無法注意到它，特別是在單獨的線程中運行時。

因爲有需要，我打算長時間（比如2年:)）寫博客。但我最終設法寫下幾句話並提供一些提示。請看看這裏：

ISolvable.blogspot.com

雖然它是微軟的平臺，還是一般的原則是相同的。

來源

2011-05-31 04:56:50 Ivan

是的，這可以做得更快。首先，你根本不需要StringBuffers。另一方面，您不需要單獨的循環來計算換位。

你可以找到my implementation here，它應該快很多。它在Apache 2.0許可下。

來源

2011-09-15 07:05:04 larsga

使用GetCommonCharacters方法，而不是返回常見的字符，請使用一對數組，以保持匹配，C版同樣在這裏https://github.com/miguelvps/c/blob/master/jarowinkler.c

/*Calculate matching characters*/ 
for (i = 0; i < al; i++) { 
    for (j = max(i - range, 0), l = min(i + range + 1, sl); j < l; j++) { 
     if (a[i] == s[j] && !sflags[j]) { 
      sflags[j] = 1; 
      aflags[i] = 1; 
      m++; 
      break; 
     } 
    } 
}

另一種優化方法是預先計算每串位掩碼。使用它，檢查第一個字符串上的當前字符是否出現在第二個字符串上。這可以使用有效的按位操作來完成。

這將跳過計算最大/最小和循環的缺失字符。

來源

2017-03-03 21:15:12 dvidben

優化Jaro-Winkler算法

回答

相關問題