Java：數據結構存儲大量字

我必須在Java程序中存儲大量字（+ 200k），並且我想快速訪問它們。我只需要知道給定的單詞是否屬於我的「詞典」。我不需要像<word, smthg>這樣的一對。如果可能，我正在標準庫中尋找解決方案。Java：數據結構存儲大量字

PS：也許使用數據結構不是更好的方法來做到這一點？每次讀取包含單詞的文件會更有效率？

編輯：這是一個小項目。我必須處理效率和內存

最後編輯：我最終選擇HashSet。

來源

2013-04-18 DouglasAdams

聽起來像[HashSet]（http://docs.oracle.com/javase/7/docs/api/java/util/HashSet.html）可能非常合適。 – Keppil

你對使用[Lucene]有任何想法（http://lucene.apache.org/） – SenthilPrabhu

@Keppil HashSet中的問題是它沒有排序。所以搜索會更慢。 –

使用java集因爲集合是像TreeSet這樣的線性排序數據結構。所以對於搜索來說，像二進制搜索這樣的技術可以實現，而且它們快速且不重複。

這是一個java集合的結構。

enter image description here

它也將不會允許重複，因此減少冗餘，節省你的記憶。

如果您想了解各種搜索算法的複雜性，請參閱此鏈接。這裏是

http://bigocheatsheet.com/

來源

2013-04-18 10:21:51

集合會浪費大量內存。這類任務有專門的數據結構。 –

@IvayloStrandjev存儲在HashSet中的平均10個字符的200k字可能需要5到10MB的內存。這並不是很多... – assylias

剛剛嘗試過，它接近20MB，但還是不多。 – assylias

根據單詞的分佈情況，使用Trie或Patricia tree。我個人會選擇Patricia樹，因爲它更適合內存使用（雖然實現起來比較困難）。

來源

2013-04-18 10:20:47

對於像OP的用例那樣的相當少量的對象，HashSet可以做得很好。另外值得注意的是標準JDK中沒有Trie/Patricia Tree實現。 – assylias

也許你想測試我的TrieMap或TrieSet實現（found here）？我專門爲這類案件編寫了它們。到目前爲止，我已經爲String和byte[]鍵實施了Tries。

TrieSet<String> t = Tries.newStringTrieSet(); 

    t.add("hello"); 
    t.add("help"); 
    t.add("hell"); 
    t.add("helmet"); 
    t.add("hemp"); 

    List<String> resultsA = new ArrayList<>(); 
    t.findElements("hel", true, resultsA); // search for prefix 

    List<String> resultsB = new ArrayList<>(); 
    t.findElements("ell", false, resultsB); // search for substring 

    System.out.println("A: " + resultsA); 
    System.out.println("B: " + resultsB);

這將打印：

A: [hell, hello, helmet, help] 
B: [hell, hello]

來源

2013-04-18 11:36:02

> 1.5 KLOC，而不是一個單一的測試？ –

這看起來很確定，我不知道如果我錯了，因爲某些原因：

//put all your words to an ArrayList and sort the list. 
List <String> arr = new Arraylist<>(); 
while(there is next) 
    arr.add(theWord) 
Collections.sort(arr); 

//this is your search method 
boolean mysearch(keyword){ 
    return Collections.binarySearch(arr, keyword) 
}

的表現爲：O(n*log_n)爲插入數據和搜索是O(log_n)

假設每個字符串是20B，在a verage。 20B *200000 = 4MB空間。

來源

2013-04-18 11:41:26 smttsp

Java：數據結構存儲大量字

回答

相關問題