2013-08-05 79 views
5

我們如何編寫一個輸出字符串「homoglyph equivalents」的高效函數?查找所有「字符相等」字符串的高效算法?

實施例1(僞代碼):

homoglyphs_list = [ 
        ["o", "0"], // "o" and "0" are homoglyphs 
        ["i", "l", "1"] // "i" and "l" and "1" are homoglyphs 
        ] 

input_string = "someinput" 
output   = [ 
        "someinput", "s0meinput", "somelnput", 
        "s0melnput", "some1nput", "s0me1nput" 
        ] 

實施例2

homoglyphs_list = [ 
        ["rn", "m", "nn"], 
        ] 

input_string = "rnn" 
output   = ["rnn", "rm", "mn", "rrn", "nnn", "nm", "nrn"] 

實施例3

homoglyphs_list = [ 
        ["d", "ci", "a"], // "o" and "0" are homoglyphs 
        ["i", "l", "1"] // "i" and "l" and "1" are homoglyphs 
        ] 
        /* 
        notice that with the two rules above, 
        we can infer "d" = "ci" = "a" = "cl" = "c1" 
        */ 

input_string = "pacerier" 
output   = [ 
        "pacerier", "pacerler", "pacer1er", "pcicerier", 
        "pcicerler", "pcicer1er", "pclcerier", "pc1cerier", 
        "pclcerler", "pc1cerler", "pclcer1er", "pc1cer1er", 
        "pdcerier", "pdcerler", "pdcer1er" 
        ] 

注意:輸出數組中成員的順序並不重要,我們可以假設給定的同類映射映射被假定爲正確的(輸入不會給我們一個「無限循環」)。

我目前的算法有效,但它使用原始bruteforcing和性能是可怕的。例如。的"mmmmm"有同形字["rn", "m", "nn"]輸入需要38秒運行:

// This is php code (non-pseudo just so we could test the running time), 
// but the question remains language-agnostic 

public function Func($in, Array $mappings){ 
    $out_table = array(); 
    $out_table[$in] = null; 
    while(true){ 
     $number_of_entries_so_far = count($out_table); 
     foreach(array_keys($out_table) as $key){ 
      foreach($mappings as $mapping){ 
       foreach($mapping as $value){ 
        for($a=0, $aa=strlen($key); $a<$aa; ++$a){ 
         $pos = strpos($key, $value, $a); 
         if($pos === false){ 
          continue; 
         } 
         foreach($mapping as $equal_value){ 
          if($value === $equal_value){ 
           continue; 
          } 
          $out_table[substr($key, 0, $pos) . $equal_value . substr($key, $pos + strlen($value))] = null; 
         } 
        } 

       } 
      } 
     } 
     if($number_of_entries_so_far === count($out_table)){ 
      // it means we have tried bruteforcing but added no additional entries, 
      // so we can quit now 
      break; 
     } 
    } 
    return array_keys($out_table); 
} 

我們怎樣才能實現高效的(快)同形字擴展算法?

+0

它會做什麼,如果我像'$ homoglyph_mappings [0] = array(「n」,「nn」,「nnn」);'' – Rajan

+0

@Rajan,如上所述,我們可以假定輸入是正確的(即,根據不正確的輸入,我們得到未定義的行爲)。你的例子是一個不正確的輸入,因爲它會給我們一個無限循環。如果'n = nn',然後'nn = nnnn',然後'nnnn = nnnnnnnn',然後'nnnnnnnn = nnnnnnnnnnnnnnn'等等,*無限* ... – Pacerier

+0

好吧,這是一個_exceptional case_。 – Rajan

回答

1

您應該從遞歸實現中獲得一些性能,但我沒有在實際性能方面做太多的考慮。這會避免多次循環輸出字符串,並在每次迭代中計數輸出。此外,我添加了一個「緩存」,以防止兩次計算相同的同類映像,爲簡單起見,它不檢查映射,但可以實現它,或者在映射更改的每次調用之前清除緩存(但這是醜陋和容易引入錯誤)。

代碼如下所示:

cache = {} 
def homoglyph(input_string, mappings): 
    output = [] 
    character = input_string[0] 
    if input_string in cache: 
     return cache[input_string] 
    elif not input_string: 
     return [] 
    for charset in mappings: 
     if character in charset: 
      temp = input_string 
      sub_homoglyphs = homoglyph(input_string[1:], mappings) 
      for char in charset: 
       temp[0] = char 
       output.append(temp) 
       #adding the homoglyph character to the rest of the string 
       for subhomoglyph in sub_homoglyphs: 
        output.append(char+subhomoglyph) 
    cache[input_string] = output 
    return output 

(這是使用Python編寫的,因爲我沒有得到很好的PHP語法精通,我還沒有真正運行這個所以可能會有錯誤,但邏輯的要點在那裏)

相關問題