搜索最長的前綴

我們假設它有一串字符串D。給定字符串Q，我想查找D中具有最長公用前綴Q的字符串。搜索最長的前綴

我不想要一個複雜的數據結構，但它仍然應該比單純的線性掃描更快。

是否有解決方案，以巧妙的方式排序D，只做一個二進制搜索？

謝謝！

編輯

澄清：當然，如果只進行一次，單次掃描是比排序更快。然而，我需要在固定的D上做很多這樣的查找，所以這就是我尋找預計算數據結構的原因。

來源

2017-03-15 user3612643

對我來說沒有意義，平均來說甚至是最快的排序* O（n log n）*比線性搜索* O（n）*慢。 – Pavlo

它必須做很多次，所以這就是爲什麼我要尋找預先計算的數據結構。 – user3612643

_「有沒有解決方案」_你有什麼嘗試？完成其他人的功課並不是很有趣...... – evolutionxbox

我在Java中一起實現了一個實現（因爲我不知道如何打印或javascript）。這個方法雖然可以翻譯，所以我希望這可能會有所幫助。

這是我的思維過程：

d是恆定的，所以我們要找到一種方法，找到具有共同的前綴的所有單詞。因此，爲此我執行：

一種樹狀結構，它根據字符的字符串對字符串進行索引。意字符串artur將被存儲在a - >r - >t - >u等
這使索引d在時間O（n），其中n是串的長度的複雜性。
這使得搜索詞共享一個共同的前綴爲O（n），其中n是我們正在尋找的

該方法的前綴長度有一些限制，這樣我可以更快的測試： *只允許小寫字母 *將字符串存儲在中間以避免在查找前綴時遍歷樹。

所以，我的代碼，我有這些測試，還增加了一些時間，看看會發生什麼：

public class CommonPrefixTree { 

    public static void main(String[] args) { 
     Node treeRoot = new Node(); 

     index("Artur", treeRoot); 
     index("ArturTestMe", treeRoot); 
     index("Blop", treeRoot); 
     index("Muha", treeRoot); 
     index("ArtIsCool", treeRoot); 

     List<String> strings = new ArrayList<>(); 

     char[] chars = "abcdefghijklmnopqrstuvwxyz".toCharArray(); 
     Random r = new Random(); 
     for(int i = 0; i < 500000; i++) { 
      StringBuffer b = new StringBuffer(); 
      for(int j = 0; j < 20 ; j++) { 
       b.append(chars[r.nextInt(chars.length)]); 
      } 
      strings.add(b.toString()); 
      index(b.toString(), treeRoot); 
     } 

     strings.add("art"); 
     strings.add("a"); 
     strings.add("artu"); 
     strings.add("arturt"); 
     strings.add("b"); 

     System.out.println(" ----- Tree search -----"); 
     find("art", treeRoot); 
     find("a", treeRoot); 
     find("artu", treeRoot); 
     find("arturT", treeRoot); 
     find("b", treeRoot); 

     // The analog test for searching in a list 

     System.out.println(" ----- List search -----"); 
     findInList("art", strings); 
     findInList("a", strings); 
     findInList("artu", strings); 
     findInList("arturt", strings); 
     findInList("b", strings); 

    } 

    static class Node { 

     Node[] choices = new Node[26]; 
     Set<String> words = new HashSet(); 

     void add(String word) { 
      words.add(word); 
     } 

     boolean contains(String word) { 
      return words.contains(word); 
     } 

    } 

    static List<String> findInList(String prefix, List<String> options) { 
     List<String> res = new ArrayList<>(); 
     long start = System.currentTimeMillis(); 
     for(String s : options) { 
      if(s.startsWith(prefix)) res.add(s); 
     } 

     System.out.println("Search took: " + (System.currentTimeMillis() - start)); 
     return res; 
    } 

    static void index(final String toIndex, final Node root) { 
     Node tmp = root; 
     // indexing takes O(n) 
     for(char c : toIndex.toLowerCase().toCharArray()) { 
      int val = (int) (c - 'a'); 
      tmp.add(toIndex); 
      if(tmp.choices[val] == null) { 
       tmp.choices[val] = new Node(); 
       tmp = tmp.choices[val]; 
      } else { 
       tmp = tmp.choices[val]; 
       if(tmp.contains(toIndex)) return; // stop, we have seen the word before 
      } 
     } 
    } 

    static Set<String> find(String prefix, final Node root) { 

     long start = System.currentTimeMillis(); 

     Node tmp = root; 
     // step down the tree to all common prefixes, O(n) where prefix defines n 
     for(char c : prefix.toLowerCase().toCharArray()) { 
      int val = (int) (c - 'a'); 
      if(tmp.choices[val] == null) { 
       return Collections.emptySet(); 
      } 
      else tmp = tmp.choices[val]; 
     } 

     System.out.println("Search took: " + (System.currentTimeMillis() - start)); 
     return tmp.words; 
    } 
}

結果樹和原始列表搜索

那麼這將導致這些對於5個搜索100，10000和500k的字符串的定時：

----- Tree search ----- 
Search took: 0 
Search took: 0 
Search took: 0 
Search took: 0 
Search took: 0 
----- List search ----- 
Search took: 0 
Search took: 0 
Search took: 0 
Search took: 0 
Search took: 0

----- Tree search ----- 
Search took: 0 
Search took: 0 
Search took: 0 
Search took: 0 
Search took: 0 
----- List search ----- 
Search took: 2 
Search took: 2 
Search took: 2 
Search took: 2 
Search took: 2

----- Tree search ----- 
Search took: 0 
Search took: 0 
Search took: 0 
Search took: 0 
Search took: 0 
----- List search ----- 
Search took: 43 
Search took: 27 
Search took: 66 
Search took: 25 
Search took: 24

的主要問題有，這是創建樹（這可能只是我的哈克實施一棵樹或我浪費內存做它的方式）。所以還有改進的空間。樹的創建確實需要很長時間。

該實驗表明，對於這個常見前綴的查找使用樹的時間消耗是穩定的。

觀光雖然考慮可能是：

用於數據結構稀疏數組。
不存儲實際的字符串，而是遍歷樹找到所有常用前綴

希望幫助 - 好玩的小運動。讓我知道如果我塞進它完全:)

上排序輸入的二進制搜索

我也注意到你問一個不復雜的數據結構，所以我嘗試了以下內容：

對字符串的輸入列表進行排序
二進制搜索匹配我們尋找的前綴的第一個索引
收集左右前綴

這將導致該代碼（再次，抱歉，這是Java，但它應該是翻譯很容易:)

static Set<String> getCommonPrefix(final String prefix, final List<String> input) { 

     long start = System.currentTimeMillis(); 

     int index = Collections.binarySearch(input, prefix, new Comparator<String>() { 
      @Override 
      public int compare(String o1, String o2) { 
       // o2 being the prefix 
       if(o1.startsWith(o2)) return 0; 
       return o1.compareTo(o2); 
      } 
     }); 

     if(index < 0) { 
      return Collections.emptySet(); 
     } 

     Set<String> res = new HashSet<>(); 
     res.add(input.get(index)); 

     boolean keepSearching = true; 
     int tmp = index - 1; 
     while(keepSearching && tmp > 0) { 
      if(input.get(tmp).startsWith(prefix)) { 
       res.add(input.get(tmp)); 
      } else { 
       keepSearching = false; 
      } 
      tmp--; 
     } 

     keepSearching = true; 
     tmp = index + 1; 
     while(keepSearching && tmp < input.size()) { 
      if(input.get(tmp).startsWith(prefix)) { 
       res.add(input.get(tmp)); 
      } else { 
       keepSearching = false; 
      } 
      tmp++; 
     } 

     System.out.println("Search took: " + (System.currentTimeMillis() - start)); 

     return res; 
    }

這其中有一個有趣的現象。搜索將採取O(log n)，其中n是數組的輸入大小。然後集合是線性的k其中k是通用前綴的數量。

有趣的是，只要前綴相當大，這種方法非常快（與樹實現相比），但是一旦你尋找非常少的前綴，這會變得有點慢，要檢索的字符串相當大。詳細的時序爲（500萬個隨機字符串）：

Search for 'art' took: 1 
Found strings: 309 
Search for 'artur2' took: 0 
Found strings: 1 
Search for 'asd' took: 0 
Found strings: 265 
Search for 'nnb' took: 1 
Found strings: 276 
Search for 'asda' took: 0 
Found strings: 10 
Search for 'c' took: 63 
Found strings: 192331

我想，從一個Java的腳本點，如果你有一個內置的二進制搜索，最後的方法可能是最簡單，最直接的選擇就是構建和維護一棵樹，這對我來說有點更重要（對我來說）花了很多時間來對字符串進行索引。

來源

2017-03-15 16:24:41 pandaadb

這看起來像Java，但問題被標記爲'javascript'和'typescript'。 – Pavlo

的確如此。這個想法可以很容易地翻譯。 – pandaadb

在D創建基於字符的樹：

每個node包含character和兒童node s的名單。

例如，如果D是

a 
ab 
ac 
ace 
d

然後

有2個頂級節點a和d
d一直沒有孩子
a有2個孩子 - b和c
b沒有孩子
c有1個小孩 - （！並添加到樹）e
e有沒有孩子

查找基本上走的節點，直到沒有匹配的兒童。

例如，假設Q=af。有一個包含Q[0]=a的頂級節點，但它沒有Q[1]=f的子節點，所以最長的前綴是a。 a節點的所有子節點代表D中的字符串，其最長公共前綴爲Q，具體爲a,ab,ac,ace。

查找和添加操作在字符串長度中都是線性的，因此創建結構需要O(sum(len(x) for x in D))時間並且查找是O(len(Q))。

來源

2017-03-15 16:06:10 sds

如何從樹中獲取原始字符串？ – Pavlo

@Pavlo：什麼「原始字符串」？ – sds

正如問題所述：'我想找到D中具有最長共同前綴Q的字符串'。所以我想象一下找到這個前綴本身很容易，但是整個字符串呢？ – Pavlo

搜索最長的前綴

回答

相關問題