查找所有「字符相等」字符串的高效算法？

我們如何編寫一個輸出字符串「homoglyph equivalents」的高效函數？查找所有「字符相等」字符串的高效算法？

實施例1（僞代碼）：

homoglyphs_list = [ 
        ["o", "0"], // "o" and "0" are homoglyphs 
        ["i", "l", "1"] // "i" and "l" and "1" are homoglyphs 
        ] 

input_string = "someinput" 
output   = [ 
        "someinput", "s0meinput", "somelnput", 
        "s0melnput", "some1nput", "s0me1nput" 
        ]

實施例2：

homoglyphs_list = [ 
        ["rn", "m", "nn"], 
        ] 

input_string = "rnn" 
output   = ["rnn", "rm", "mn", "rrn", "nnn", "nm", "nrn"]

實施例3：

homoglyphs_list = [ 
        ["d", "ci", "a"], // "o" and "0" are homoglyphs 
        ["i", "l", "1"] // "i" and "l" and "1" are homoglyphs 
        ] 
        /* 
        notice that with the two rules above, 
        we can infer "d" = "ci" = "a" = "cl" = "c1" 
        */ 

input_string = "pacerier" 
output   = [ 
        "pacerier", "pacerler", "pacer1er", "pcicerier", 
        "pcicerler", "pcicer1er", "pclcerier", "pc1cerier", 
        "pclcerler", "pc1cerler", "pclcer1er", "pc1cer1er", 
        "pdcerier", "pdcerler", "pdcer1er" 
        ]

注意：輸出數組中成員的順序並不重要，我們可以假設給定的同類映射映射被假定爲正確的（輸入不會給我們一個「無限循環」）。

我目前的算法有效，但它使用原始bruteforcing和性能是可怕的。例如。的"mmmmm"有同形字["rn", "m", "nn"]輸入需要38秒運行：

// This is php code (non-pseudo just so we could test the running time), 
// but the question remains language-agnostic 

public function Func($in, Array $mappings){ 
    $out_table = array(); 
    $out_table[$in] = null; 
    while(true){ 
     $number_of_entries_so_far = count($out_table); 
     foreach(array_keys($out_table) as $key){ 
      foreach($mappings as $mapping){ 
       foreach($mapping as $value){ 
        for($a=0, $aa=strlen($key); $a<$aa; ++$a){ 
         $pos = strpos($key, $value, $a); 
         if($pos === false){ 
          continue; 
         } 
         foreach($mapping as $equal_value){ 
          if($value === $equal_value){ 
           continue; 
          } 
          $out_table[substr($key, 0, $pos) . $equal_value . substr($key, $pos + strlen($value))] = null; 
         } 
        } 

       } 
      } 
     } 
     if($number_of_entries_so_far === count($out_table)){ 
      // it means we have tried bruteforcing but added no additional entries, 
      // so we can quit now 
      break; 
     } 
    } 
    return array_keys($out_table); 
}

我們怎樣才能實現高效的（快）同形字擴展算法？

來源

2013-08-05 Pacerier

它會做什麼，如果我像'$ homoglyph_mappings [0] = array（「n」，「nn」，「nnn」）;'' – Rajan

@Rajan，如上所述，我們可以假定輸入是正確的（即，根據不正確的輸入，我們得到未定義的行爲）。你的例子是一個不正確的輸入，因爲它會給我們一個無限循環。如果'n = nn'，然後'nn = nnnn'，然後'nnnn = nnnnnnnn'，然後'nnnnnnnn = nnnnnnnnnnnnnnn'等等，*無限* ... – Pacerier

好吧，這是一個_exceptional case_。 – Rajan

您應該從遞歸實現中獲得一些性能，但我沒有在實際性能方面做太多的考慮。這會避免多次循環輸出字符串，並在每次迭代中計數輸出。此外，我添加了一個「緩存」，以防止兩次計算相同的同類映像，爲簡單起見，它不檢查映射，但可以實現它，或者在映射更改的每次調用之前清除緩存（但這是醜陋和容易引入錯誤）。

代碼如下所示：

cache = {} 
def homoglyph(input_string, mappings): 
    output = [] 
    character = input_string[0] 
    if input_string in cache: 
     return cache[input_string] 
    elif not input_string: 
     return [] 
    for charset in mappings: 
     if character in charset: 
      temp = input_string 
      sub_homoglyphs = homoglyph(input_string[1:], mappings) 
      for char in charset: 
       temp[0] = char 
       output.append(temp) 
       #adding the homoglyph character to the rest of the string 
       for subhomoglyph in sub_homoglyphs: 
        output.append(char+subhomoglyph) 
    cache[input_string] = output 
    return output

（這是使用Python編寫的，因爲我沒有得到很好的PHP語法精通，我還沒有真正運行這個所以可能會有錯誤，但邏輯的要點在那裏）

來源

2013-08-20 20:13:24 Don

查找所有「字符相等」字符串的高效算法？

回答

相關問題