用於搜索字符串的更快數據結構

我有這個代碼確定一個單詞（忽略大小寫）是否包含在wordList文本文件中。但是，wordList文本文件可能有65000 ++行，而使用下面的實現只搜索一個詞需要近一分鐘。你能想到更好的實現嗎？用於搜索字符串的更快數據結構

謝謝！

import java.io.*; 
import java.util.*; 

public class WordSearch 
{ 
    LinkedList<String> lxx; 
    FileReader fxx; 
    BufferedReader bxx; 

    public WordSearch(String wordlist) 
     throws IOException 
    { 
     fxx = new FileReader(wordlist); 
     bxx = new BufferedReader(fxx); 
     lxx = new LinkedList<String>(); 
     String word; 

     while ((word = bxx.readLine()) != null) 
      { 
      lxx.add(word); 
     } 

     bxx.close(); 
    } 

    public boolean inTheList (String theWord) 
    { 
     for(int i =0 ; i < lxx.size(); i++) 
      { 
      if (theWord.compareToIgnoreCase(lxx.get(i)) == 0) 
        { 
       return true; 
      } 
     } 

     return false; 
    } 
}

來源

2011-03-05 Mariska

所有編輯器（包括SO的magic textarea）中的空格端口的縮進比製表符要好。 –

有多少個不同的單詞？ –

我們在哪裏可以得到一長串單詞？我設法模擬15k，我正在運行一個ms – OscarRyz

使用一個HashSet，在其中放入每個單詞的小寫版本。檢查一個HashSet是否包含一個指定的字符串，平均來說是一個恆定時間（讀取：非常快）的操作。

來源

2011-03-05 02:39:39

如何獲取搜索結果的索引？ –

@ahmedghanayem：「索引」和「搜索結果」是什麼意思？這個問題是關於尋找一個確切的（但不區分大小寫）單個搜索詞的匹配;這與完整的搜索引擎不同，後者可能會返回一組與搜索詞匹配的文檔以不同程度。 –

由於您正在搜索，您可能需要考慮在搜索前對列表進行排序;那麼你可以做二分搜索，這比線性搜索快得多。如果您在同一個列表中執行多個搜索，這可能會有所幫助，否則您爲列表排序付出的代價並不值得只搜索一次。

另外，使用「lxx.get（i）」在鏈表上進行線性搜索會造成麻煩。 LinkedList.get（）是O（n）。您可以使用Iterator（簡單的方法：for（String s：lxx））或切換到具有O（1）訪問時間的列表類型，如ArrayList。

來源

2011-03-05 02:38:52 vanza

每次在O（n）操作中通過l進行搜索，因此當您有數千個單詞時，這將花費相當高昂。相反，使用HashSet：

Set<String> lxx; 

... 

lxx = new HashSet<String>(); 
while ((word = bxx.readLine()) != null) { 
     lxx.add(word.toLowerCase()); 
} 
bxx.close();

然後用lxx.contains(theWord.toLowerCase())檢查，如果字是在文件中。 HashSet中的每個查找都是O（1）操作，因此所花費的時間幾乎與文件大小無關。

來源

2011-03-05 02:42:51

首先，不聲明你的變量是一個LinkedList，宣稱它是一個List（部分代碼並不關心刪除列表：

public class WordSearch 
{ 
    List<String> lxx; 

    public WordSearch(String wordlist) 
     throws IOException 
    { 
     lxx = new LinkedList<String>(); 
    } 
}

接下來不叫得到列表中，使用一個LinkedList得到的將是非常緩慢的而不是使用迭代器......更好的使用循環的新的S型，其採用了迭代器你：

public boolean inTheList (String theWord) 
    { 
     for(String word : lxx) 
     { 
      if (theWord.compareToIgnoreCase(word) == 0) 
      { 
       return true; 
      } 
     } 

     return false; 
    }

下一頁改變新的LinkedList到一個新的ArrayList ：

lxx = new ArrayList（）;

此代碼應該更快，但您仍然可以做得更好。

由於您不關心重複的單詞，請使用Set而不是List，並使用HashSet而不是ArrayList。

這樣做會顯着提高程序的速度。

您的原始代碼，使用帶有LinkedList的LinkedList必須從列表的開始處重新開始，每次搜索列表中的下一個單詞時。使用迭代器（通過每個循環的新樣式）可以避免發生這種情況。

使用LinkedList意味着每次必須轉到列表中的下一個單詞時都會涉及查找，ArrayList沒有這種開銷。

使用HashSet使用具有非常快的查找的樹結構（可能）。

來源

2011-03-05 03:09:58 TofuBeer

這是我的實現在50毫秒以下搜索。

首先，您必須加載文件並將其保存在內存中。

不管你想不想加載它，但是如果你將它加載到大塊中會更容易。

我的輸入是在byte into python book（下載HTML單文件版本）和Java language specification（下載HTML和創建一個單獨的文件了所有的HTML頁面）

創建列表進入我用一個大文件這個相同的程序（參見注釋代碼）。

一旦我有30萬左右的話大的文件，我跑了此輸出方案：

C:\Users\oreyes\langs\java\search>dir singlelineInput.txt 
El volumen de la unidad C no tiene etiqueta. 
El número de serie del volumen es: 22A8-203B 

Directorio de C:\Users\oreyes\langs\java\search 

04/03/2011 09:37 p.m.   3,898,345 singlelineInput.txt 
       1 archivos  3,898,345 bytes 

C:\Users\oreyes\langs\java\search>javac WordSearch.java 

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "great" 
Loaded 377381 words in 2844 ms 
true 
in 31 ms 

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "great" 
Loaded 377381 words in 2812 ms 
true 
in 31 ms 

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "awesome" 
Loaded 377381 words in 2813 ms 
false 
in 47 ms 

C:\Users\oreyes\langs\java\search>gvim singlelineInput.txt 

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "during" 
Loaded 377381 words in 2813 ms 
true 
in 15 ms 

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "specification" 
Loaded 377381 words in 2875 ms 
true 
in 47 ms 

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "<href" 
Loaded 377381 words in 2844 ms 
false 
in 47 ms 

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "<br>" 
Loaded 377381 words in 2829 ms 
true 
in 15 ms

始終低於50毫秒。

下面的代碼：

import java.io.*; 
    import java.util.*; 

    class WordSearch { 
     String inputFile; 
     List<String> words; 
     public WordSearch(String file) { 
      inputFile = file; 
     } 
     public void initialize() throws IOException { 
      long start = System.currentTimeMillis(); 
      File file = new File(inputFile); 
      ByteArrayOutputStream baos = new ByteArrayOutputStream((int) file.length()); 
      FileInputStream in = new FileInputStream(file); 
      copyLarge(in, baos, (int)file.length()); 

      Scanner scanner = new Scanner(new ByteArrayInputStream( baos.toByteArray())); 
      words = new LinkedList<String>(); 
      while(scanner.hasNextLine()) { 
       String l = scanner.nextLine().trim(); 
       //for(String s : l.split("\\s+")){ 
       //System.out.println(s); 
       words.add(l.toLowerCase()); 
       //} 
      } 

      Collections.sort(words); 
      for(String s : words) { 
       //System.out.println(s); 
      } 
      System.out.println("Loaded " + words.size() + " words in "+ (System.currentTimeMillis() - start) + " ms" ); 
     } 

     public boolean contains(String aWord) { 
      return words.contains(aWord.toLowerCase()); 
     } 
     // taken from: http://stackoverflow.com/questions/326390/how-to-create-a-java-string-from-the-contents-of-a-file/326413#326413 
     public static long copyLarge(InputStream input, OutputStream output, int size) 
       throws IOException { 
      byte[] buffer = new byte[size];// something biggie 
      long count = 0; 
      int n = 0; 
      while (-1 != (n = input.read(buffer))) { 
       output.write(buffer, 0, n); 
       count += n; 
      } 
      return count; 
     } 
     public static void main(String ... args) throws IOException { 
      WordSearch ws = new WordSearch(args[0]); 
      ws.initialize(); 
      long start = System.currentTimeMillis(); 
      System.out.println(ws.contains(args[1])); 
      System.out.println("in "+ (System.currentTimeMillis() - start) +" ms "); 

     } 
    }

難的是得到一個樣本輸入：P

來源

2011-03-05 03:46:45 OscarRyz

你猜怎麼着，使用在任何時間一個HashMap的回報：

下面是修改後的版本，它始終在0毫秒內完成。

import java.io.*; 
    import java.util.*; 

    class WordSearch { 
     String inputFile; 
     //List<String> words; 
     Set<String> words; 
     public WordSearch(String file) { 
      inputFile = file; 
     } 
     public void initialize() throws IOException { 
      long start = System.currentTimeMillis(); 
      File file = new File(inputFile); 
      ByteArrayOutputStream baos = new ByteArrayOutputStream((int) file.length()); 
      FileInputStream in = new FileInputStream(file); 
      copyLarge(in, baos, (int)file.length()); 

      Scanner scanner = new Scanner(new ByteArrayInputStream( baos.toByteArray())); 
      words = new HashSet<String>(); 
      while(scanner.hasNextLine()) { 
       String l = scanner.nextLine().trim(); 
       //for(String s : l.split("\\s+")){ 
       //System.out.println(s); 
       words.add(l.toLowerCase()); 
       //} 
      } 

      //Collections.sort(words); 
      for(String s : words) { 
       System.out.println(s); 
      } 
      System.out.println("Loaded " + words.size() + " words in "+ (System.currentTimeMillis() - start) + " ms" ); 
     } 

     public boolean contains(String aWord) { 
      return words.contains(aWord.toLowerCase()); 
     } 

     public static long copyLarge(InputStream input, OutputStream output, int size) 
       throws IOException { 
      byte[] buffer = new byte[size];// something biggie 
      long count = 0; 
      int n = 0; 
      while (-1 != (n = input.read(buffer))) { 
       output.write(buffer, 0, n); 
       count += n; 
      } 
      return count; 
     } 
     public static void main(String ... args) throws IOException { 
      WordSearch ws = new WordSearch(args[0]); 
      ws.initialize(); 
      long start = System.currentTimeMillis(); 
      System.out.println(ws.contains(args[1])); 
      System.out.println("in "+ (System.currentTimeMillis() - start) +" ms "); 

     } 
    }

現在我知道肯定:)

來源

2011-03-05 03:51:01 OscarRyz

兩個建議：兩個數據結構給你一個更好的性能。

向無環詞圖（DAWG）
字典的數據結構。 n-tree

來源

2011-09-28 12:28:12

用於搜索字符串的更快數據結構

回答

相關問題