2012-11-02 39 views
0

前綴字符串二進制搜索我有一個測試工具,出於某種原因,這個代碼是在發現前綴失敗,也忽略了短詞。任何建議/提示/想法?在一個dict.txt文件

def search(str): 
    """Search for a prefix string in the dictionary. 
    Args: 
     str: A string to look for in the dictionary 
    Returns: 
     code WORD if str exactly matches a word in the dictionary, 
      PREFIX if str does not match a word exactly but is a prefix 
       of a word in the dictionary, or 
     NO_MATCH if str is not a prefix of any word in the dictionary 
    """ 

    left = 0 
    right = len(dict) - 1 
    mid = (left + right) // 2 
    elem = dict[mid] 
    while right >= left: 
     if elem == str: 
      return WORD 
     elif elem < str: 
      left = mid + 1 
      mid = (left + right) // 2 
     elif elem > str: 
      right = mid - 1 
      mid = (left + right) // 2 
     elif elem == str[0:len(elem)]: 
      return PREFIX 
     elem = dict[mid] 
     #print(left, right, mid) 

    return NO_MATCH 
+1

我認爲前綴總是會小於一個完整的字符串,所以最後elif不會被擊中 – noisecapella

+1

dict()是內建的,喲你應該避免命名覆蓋內建函數的變量。和str()以及.. – monkut

回答

0

不太確定需要什麼,但滑動窗口對於這些搜索很有用。

from itertools import islice 


def window(seq, n=2): 
    "Returns a sliding window (of width n) over data from the iterable" 
    " s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...     " 
    it = iter(seq) 
    result = tuple(islice(it, n)) 
    if len(result) == n: 
     yield result  
    for elem in it: 
     result = result[1:] + (elem,) 
     yield result 


WORD = "Found word!" 
PREFIX = "Found prefix!" 
NO_MATCH = "No match found!"  

def search(search_for, search_in): 
    assert len(search_for) < len(search_in) 

    window_size = len(search_for) 
    total_length = len(search_in) 
    search_window = window(search_in, len(search_for)) 

    for idx, search_group in enumerate(search_window, window_size): 
     joined_str = "".join(search_group) 
     if joined_str == search_for: 
      # found match, determine if there is any left 
      if idx < total_length: 
       return PREFIX 
      elif idx == total_length: 
       return WORD    
    return NO_MATCH 
2
left = 0 
right = len(dict) - 1 
elem = dict[mid] 
while right >= left: 
    mid = (left + right) // 2 #compute one time 
    if elem == str: 
     return WORD 
    elif elem < str: 
     left = mid + 1 
    elif elem > str: 
     right = mid - 1 
    elif elem == str[0:len(elem)]: 
     return PREFIX 
    elem = dict[mid] 
    #print(left, right, mid) 

return NO_MATCH 

你不必計算,並在IFS中期,但字典是如何構成的? 和多少個字符必須是前綴,給一些更多的信息,所以我們可以幫助更多。

0

考慮dict.txt存在的內容:

a 
aa 
aaa 
aaaa 
aaaaa 

你搜索詞「AAC」mid發生轉動的aaa

在一個標準的二進制搜索時,搜索空間變:

aaaa 
aaaaa 

而且aa也不a,兩者都可以是前綴,永遠不會被發現。

我想你想將需要更爲複雜的算法是什麼。如果你打算將它基於二進制搜索,我可能會用str的最短長度開始(單個字符,在最壞的情況),並逐步延長它,因爲它找到匹配。

雖然我認爲,如果你把它包都在同一個左 - 右 - 旋轉環這將是最有效的,你甚至可以這樣做只是這樣的:

def search_prefix(str): 
    longest_prefix = NO_MATCH 
    for n in range(len(str)): 
     prefix = search(str[:n]) 
     if prefix == NO_MATCH: 
      break 
     longest_prefix = prefix 
    return longest_prefix 

注意:我說單字符在最壞的情況,但在現實中,你可以預先緩存dict.txt的字長(因此前綴長度),像這樣:

prefix_lengths = sorted(set(map(len, dict)))