2013-07-26 46 views
5

我想將兩個字符串分類爲相似或不相似。例如如何計算java中兩個字符串之間的匹配分數?

s1 = "Token is invalid. DeviceId = deviceId: "345" " 
s2 = "Token is invalid. DeviceId = deviceId: "123" " 
s3 = "Could not send Message." 

我要尋找一個Java庫,可以給2串之間,並從分數我能確定他們是否是類似的不匹配得分。我的程序只需要處理一個小數據集(〜2000字符串)。你知道有沒有可用的東西嗎?

匹配

回答

0

至於建議。 Levenshtein距離算法...

public class LevenshteinDistance 
{ 
    private static int minimum(int a, int b, int c) 
    { 
     return Math.min(Math.min(a, b), c); 
    } 

    public static int computeLevenshteinDistance(CharSequence str1, CharSequence str2) 
    { 
     int[][] distance = new int[str1.length() + 1][str2.length() + 1]; 

     for (int i = 0; i <= str1.length(); i++) 
      distance[i][0] = i; 
     for (int j = 1; j <= str2.length(); j++) 
      distance[0][j] = j; 

     for (int i = 1; i <= str1.length(); i++) 
      for (int j = 1; j <= str2.length(); j++) 
       distance[i][j] = minimum(distance[i - 1][j] + 1, 
             distance[i][j - 1] + 1, 
             distance[i - 1][j - 1] + ((str1.charAt(i - 1) == str2.charAt(j - 1)) ? 0 : 1)); 

     return distance[str1.length()][str2.length()]; 
    } 

    public static void main(String[] args) 
    { 
     String s1 = "Token is invalid. DeviceId = deviceId: \"345\" "; 
     String s2 = "Token is invalid. DeviceId = deviceId: \"123\" "; 
     String s3 = "Could not send Message."; 

     System.out.println(computeLevenshteinDistance(s1, s2)); // s1 VS. s2 
     System.out.println(computeLevenshteinDistance(s1, s3)); // s1 VS. s3 
     System.out.println(computeLevenshteinDistance(s2, s3)); // s2 Vs. s3 

    } 
} 
1

對於所有NLP java問題,您應該檢查Apache Lucene項目。但是,您的需求,一個簡單的Levenshtein距離算法中是足夠多的

相關問題