2012-11-20 222 views
0

我在Ruby中找到了後綴數組的實現並對其進行了一些修改。這是我有:後綴數組並搜索字符串中的子字符串

class SuffixArray 
    def initialize(str) 
     @string = str 
     @suffix_array = [] 
     (0...str.length).each do |i| 
      substring = @string[i...str.length] 
      @suffix_array << {:suffix=>substring, :index => i} 
     end 

     @sorted_suffix_array = @suffix_array.sort {|x,y| x[:suffix] <=> y[:suffix]} 
    end 

    def print_sorted 
     @sorted_suffix_array.each {|item| puts "#{item[:index]}=>#{item[:suffix]}"} 
     puts "total=>#{@sorted_suffix_array.size()}" 
    end 

    def print_unsorted 
     @suffix_array.each {|item| puts "#{item[:index]}=>#{item[:suffix]}"} 
     puts "total=>#{@suffix_array.size()}" 
    end 

    def find_substring(substring) 
     low = 0 
     high = @sorted_suffix_array.length 
     while(low <= high) do 
      mid = (low + high)/2 
      comparison = @sorted_suffix_array[mid][:suffix]#[0..substring.length] 
     if comparison > substring 
     high = mid - 1 
     elsif comparison < substring 
     low = mid + 1 
     else 
     return @sorted_suffix_array[mid][:index] 
     end 
     end 
    end 

end 

它工作正常,但它並沒有找到我想要的所有子字符串。例如

a = SuffixArray.new("there is a man who likes dogs") 
puts a.find_substring("man") #won't be found 
puts a.find_substring("who likes dogs") #will be found 
puts a.find_substring("o likes dogs") #will be found 

如何更改算法以使其找到所需的所有子字符串?

+0

我知道。這就是我問這個問題的原因。 –

+0

您可以維護後綴數組的LCP。 (最長公共前綴) - 如果搜索後綴數組中字符串後綴的前綴,則應該找到子字符串! - – Arvind

+0

我該怎麼做? –

回答

1

你的代碼幾乎是正確的。我做了一些小的修改,並且工作。

def find_substring(substring) 
    low = 0 
    high = @sorted_suffix_array.length-1 
    while(low <= high) do 
    mid = (low + high)/2 
    comparison = @sorted_suffix_array[mid][:suffix][0...substring.length] 
    if comparison > substring 
     high = mid - 1 
    elsif comparison < substring 
     low = mid + 1 
    else 
     return @sorted_suffix_array[mid][:index] 
    end 
    end 
end 
+0

我實際上已經有了。爲什麼你使用@ sorted_suffix_array.length-1而不是@ sorted_suffix_array.length? –

+0

進行二進制搜索時,低位和高位必須是有效的索引。檢查二進制搜索算法上維基百科頁面上的代碼。 –

1

對於其他人;參考,這裏有一個未持有子字符串中的哈希

要點:https://gist.github.com/bluetwin/5268722

class SuffixArray 

    attr_reader :suf, :string 

    def initialize(string) 
    @string = string 
    @suf = (0..string.size-1).sort_by{|i|@string[i..-1]} 
    end 

    def substring(idx) 
    @string[@suf[idx][email protected]] 
    end 

    def bsearch(str) 
    low = 0 
    high = @suf.length-1 
    found = nil 
    while(low <= high) do 
     mid = (low + high)/2 
     comp = substring(mid) 
     if comp > str 
     high = mid - 1 
     elsif comp < str 
     low = mid + 1 
     else 
     found = comp 
     low = high + 1 
     end 
    end 
    found 
    end 

end