soundex算法的數據結構？

任何人都可以建議我用什麼數據結構來使用soundex algorithm程序？要使用的語言是Java。如果有人曾經在Java中進行過這方面的工作。該程序應具有以下特點：能夠閱讀約50000字應該能讀一個字，並返回具有相同的soundexsoundex算法的數據結構？

我不想程序執行相關字詞上什麼樣的數據只是幾個建議結構使用。

來源

2008-11-06 javac

我相信你只需要將原始字符串轉換爲soundex鍵到散列表;表中每個條目的值將是映射到該soundex的原始字符串的集合。

Google Collections中的MultiMap集合接口（及其實現）對您很有用。

來源

2008-11-06 23:34:30

提示：如果您使用SQL作爲數據包，那麼您可以讓SQL使用兩個SQL函數SOUNDEX和DIFFERENCE來處理它。

也許不是你想要的，但很多人不知道MSsql有這兩個功能。

來源

2008-11-06 23:34:44 Stefan

soundex可以直接通過字符串實現，所以不需要任何特殊的東西。

之後，4個字符的代碼可以被視爲一個整數鍵。

然後只是建立一個字典，存儲由該整數鍵索引的單詞集。 50,000字應該很容易適應內存，所以不需要任何花哨。

然後走字典，每個桶是一組相似的發音單詞。

其實，這裏是perl的整個程序：

#!/usr/bin/perl 
use Text::Soundex; 
use Data::Dumper; 
open(DICT,"</usr/share/dict/linux.words"); 
my %dictionary =(); 
while (<DICT>) { 
     chomp(); 
     chomp(); 
     push @{$dictionary{soundex($_)}},$_; 
} 
close(DICT); 
while (<>) { 
     my @words = split/+/; 
     foreach (@words) { 
      print Dumper $dictionary{soundex($_)}; 
     } 
}

來源

2008-11-06 23:35:13

爲什麼雙chomp？（它是在DOS文件中刪除CR和LF嗎？） – 2008-11-07 06:18:21

是的，如果沒有它在DOS中，它會破壞，否則。 – 2008-11-07 12:49:50

由於同音是亂碼，我會使用一個哈希表，用同音的關鍵。

來源

2008-11-06 23:35:18 warren

class SpellChecker 
{ 

    interface Hash { 
    String hash(String); 
    } 

    private final Hash hash; 

    private final Map<String, Set<String>> collisions; 

    SpellChecker(Hash hash) { 
    this.hash = hash; 
    collisions = new TreeSet<String, Set<String>>(); 
    } 

    boolean addWord(String word) { 
    String key = hash.hash(word); 
    Set<String> similar = collisions.get(key); 
    if (similar == null) 
     collisions.put(key, similar = new TreeSet<String>()); 
    return similar.add(word); 
    } 

    Set<String> similar(String word) { 
    Set<String> similar = collisions.get(hash.hash(word)); 
    if (similar == null) 
     return Collections.emptySet(); 
    else 
     return Collections.unmodifiableSet(similar); 
    } 

}

散列策略可能是Soundex，Metaphone或你有什麼。一些策略可能是可調的（輸出多少個字符等）

來源

2008-11-07 00:13:18 erickson

你想要一個4字節的整數。

soundex算法總是返回一個4個字符的代碼，如果使用ANSI輸入，則會返回4個字節（表示爲4個字母）。

因此，存儲散列表中返回的代碼，將您的單詞轉換爲代碼並在散列表中查找它。它真的很容易。

來源

2009-01-01 15:48:38 gbjbaanb

soundex算法的數據結構？

回答

相關問題