2017-05-22 77 views
17

我試圖發現模式:最有效的方法來搜索字符串中的未知模式?

  • 發生不止一次
  • 超過100個字符長
  • 沒有任何其他已知的圖案

的子不知道任何的可能發生的模式。

例如:

  • 字符串 「男孩倒在鍾」 將返回'ell', 'the b', 'y '
  • 字符串「男孩倒下了鍾,男孩倒下了鍾」將返回'the boy fell by the bell'

採用雙層for循環,可以蠻力強行非常低效:

ArrayList<String> patternsList = new ArrayList<>(); 
int length = string.length(); 
for (int i = 0; i < length; i++) { 
    int limit = (length - i)/2; 
    for (int j = limit; j >= 1; j--) { 
     int candidateEndIndex = i + j; 
     String candidate = string.substring(i, candidateEndIndex); 

     if(candidate.length() <= 1) { 
      continue; 
     } 

     if (string.substring(candidateEndIndex).contains(candidate)) { 
      boolean notASubpattern = true; 
      for (String pattern : patternsList) { 
       if (pattern.contains(candidate)) { 
        notASubpattern = false; 
        break; 
       } 
      } 

      if (notASubpattern) { 
       patternsList.add(candidate); 
      } 
     } 
    } 
} 

然而,隨着噸的搜索模式的大字符串時,這是令人難以置信的慢。

+5

從某種意義上說,這是一種壓縮形式。您可能會對各種壓縮算法進行一些研究。 –

+0

爲什麼單個空間不是您的第一個結果示例中的元素? –

+1

@Björn因爲它只有一個字符。 –

回答

8

您可以使用n-gram來查找字符串中的模式。需要O(n)時間來掃描字符串的n元組。當你使用一個n-gram找到一個子字符串時,把它放入一個散列表中,並計算該字符串中找到該子字符串的次數。當您在字符串中搜索完n-grams後,在哈希表中搜索大於1的計數,以查找字符串中的重複出現的模式。

例如,使用6克的字符串「男孩摔倒了,男孩倒下了鍾」,會發現子字符串「男孩摔倒了」。具有該子字符串的哈希表條目將具有2的計數,因爲它在該字符串中出現兩次。改變n-gram中的單詞數量將幫助您發現字符串中的不同模式。

Dictionary<string, int>dict = new Dictionary<string, int>(); 
int count = 0; 
int ngramcount = 6; 
string substring = ""; 

// Add entries to the hash table 
while (count < str.length) { 
    // copy the words into the substring 
    int i = 0; 
    substring = ""; 
    while (ngramcount > 0 && count < str.length) { 
     substring[i] = str[count]; 
     if (str[i] == ' ') 
      ngramcount--; 
     i++; 
     count++; 
    } 
    ngramcount = 6; 
    substring.Trim(); // get rid of the last blank in the substring 
    // Update the dictionary (hash table) with the substring 
    if (dict.Contains(substring)) { // substring is already in hash table so increment the count 
     int hashCount = dict[substring]; 
     hashCount++; 
     dict[substring] = hashCount; 
    } 
    else 
     dict[substring] = 1; 
} 

// Find the most commonly occurrring pattern in the string 
// by searching the hash table for the greatest count. 
int maxCount = 0; 
string mostCommonPattern = ""; 
foreach (KeyValuePair<string, int> pair in dict) { 
    if (pair.Value > maxCount) { 
     maxCount = pair.Value; 
     mostCommonPattern = pair.Key; 
    } 
} 
1

我寫這個只是爲了好玩。我希望我已經正確地理解了這個問題,這是有效和快速的;如果沒有,請在我身上輕鬆:)我可能會優化一點我猜,如果有人認爲它有用。

private static IEnumerable<string> getPatterns(string txt) 
{ 
    char[] arr = txt.ToArray(); 
    BitArray ba = new BitArray(arr.Length); 
    for (int shingle = getMaxShingleSize(arr); shingle >= 2; shingle--) 
    { 
     char[] arr1 = new char[shingle]; 
     int[] indexes = new int[shingle]; 
     HashSet<int> hs = new HashSet<int>(); 
     Dictionary<int, int[]> dic = new Dictionary<int, int[]>(); 
     for (int i = 0, count = arr.Length - shingle; i <= count; i++) 
     { 
      for (int j = 0; j < shingle; j++) 
      { 
       int index = i + j; 
       arr1[j] = arr[index]; 
       indexes[j] = index; 
      } 
      int h = getHashCode(arr1); 
      if (hs.Add(h)) 
      { 
       int[] indexes1 = new int[indexes.Length]; 
       Buffer.BlockCopy(indexes, 0, indexes1, 0, indexes.Length * sizeof(int)); 
       dic.Add(h, indexes1); 
      } 
      else 
      { 
       bool exists = false; 
       foreach (int index in indexes) 
        if (ba.Get(index)) 
        { 
         exists = true; 
         break; 
        } 
       if (!exists) 
       { 
        int[] indexes1 = dic[h]; 
        if (indexes1 != null) 
         foreach (int index in indexes1) 
          if (ba.Get(index)) 
          { 
           exists = true; 
           break; 
          } 
       } 
       if (!exists) 
       { 
        foreach (int index in indexes) 
         ba.Set(index, true); 
        int[] indexes1 = dic[h]; 
        if (indexes1 != null) 
         foreach (int index in indexes1) 
          ba.Set(index, true); 
        dic[h] = null; 
        yield return new string(arr1); 
       } 
      } 
     } 
    } 
} 
private static int getMaxShingleSize(char[] arr) 
{    
    for (int shingle = 2; shingle <= arr.Length/2 + 1; shingle++) 
    { 
     char[] arr1 = new char[shingle]; 
     HashSet<int> hs = new HashSet<int>(); 
     bool noPattern = true; 
     for (int i = 0, count = arr.Length - shingle; i <= count; i++) 
     { 
      for (int j = 0; j < shingle; j++) 
       arr1[j] = arr[i + j]; 
      int h = getHashCode(arr1); 
      if (!hs.Add(h)) 
      { 
       noPattern = false; 
       break; 
      } 
     } 
     if (noPattern) 
      return shingle - 1; 
    } 
    return -1; 
} 
private static int getHashCode(char[] arr) 
{ 
    unchecked 
    { 
     int hash = (int)2166136261; 
     foreach (char c in arr) 
      hash = (hash * 16777619)^c.GetHashCode(); 
     return hash; 
    } 
} 

編輯
我以前的代碼有嚴重的問題。這一個是更好:

private static IEnumerable<string> getPatterns(string txt) 
{ 
    Dictionary<int, int> dicIndexSize = new Dictionary<int, int>(); 
    for (int shingle = 2, count0 = txt.Length/2 + 1; shingle <= count0; shingle++) 
    { 
     Dictionary<string, int> dic = new Dictionary<string, int>(); 
     bool patternExists = false; 
     for (int i = 0, count = txt.Length - shingle; i <= count; i++) 
     { 
      string sub = txt.Substring(i, shingle); 
      if (!dic.ContainsKey(sub)) 
       dic.Add(sub, i); 
      else 
      { 
       patternExists = true; 
       int index0 = dic[sub]; 
       if (index0 >= 0) 
       { 
        dicIndexSize[index0] = shingle; 
        dic[sub] = -1; 
       } 
      } 
     } 
     if (!patternExists) 
      break; 
    } 
    List<int> lst = dicIndexSize.Keys.ToList(); 
    lst.Sort((a, b) => dicIndexSize[b].CompareTo(dicIndexSize[a])); 
    BitArray ba = new BitArray(txt.Length); 
    foreach (int i in lst) 
    { 
     bool ok = true; 
     int len = dicIndexSize[i]; 
     for (int j = i, max = i + len; j < max; j++) 
     { 
      if (ok) ok = !ba.Get(j); 
      ba.Set(j, true); 
     } 
     if (ok) 
      yield return txt.Substring(i, len); 
    } 
} 

文本中this book在我的電腦了3.4sec。

+0

Hi @AlexQuilliam。我想知道你是否找到了一個好的解決方案。如果是這樣的話,如果你添加代碼會很好。我很好奇我的代碼相對於最佳解決方案的性能和有效性。 – Koray

0

後綴數組是正確的想法,但有一個不平凡的一件遺失,即確定什麼在文獻中被稱爲「supermaximal重複」。這裏有一個GitHub倉庫,其代碼爲:https://github.com/eisenstatdavid/commonsub。後綴數組結構使用SAIS庫,作爲子模塊進行銷售。使用Efficient repeat finding via suffix arrays (Becher–Deymonnaz–Heiber)中的findsmaxr的僞代碼的修正版本找到超最大重複。

static void FindRepeatedStrings(void) { 
    // findsmaxr from https://arxiv.org/pdf/1304.0528.pdf 
    printf("["); 
    bool needComma = false; 
    int up = -1; 
    for (int i = 1; i < Len; i++) { 
    if (LongCommPre[i - 1] < LongCommPre[i]) { 
     up = i; 
     continue; 
    } 
    if (LongCommPre[i - 1] == LongCommPre[i] || up < 0) continue; 
    for (int k = up - 1; k < i; k++) { 
     if (SufArr[k] == 0) continue; 
     unsigned char c = Buf[SufArr[k] - 1]; 
     if (Set[c] == i) goto skip; 
     Set[c] = i; 
    } 
    if (needComma) { 
     printf("\n,"); 
    } 
    printf("\""); 
    for (int j = 0; j < LongCommPre[up]; j++) { 
     unsigned char c = Buf[SufArr[up] + j]; 
     if (iscntrl(c)) { 
     printf("\\u%.4x", c); 
     } else if (c == '\"' || c == '\\') { 
     printf("\\%c", c); 
     } else { 
     printf("%c", c); 
     } 
    } 
    printf("\""); 
    needComma = true; 
    skip: 
    up = -1; 
    } 
    printf("\n]\n"); 
} 

這裏的第一個段落的文本輸出樣本:

Davids-MBP:commonsub eisen$ ./repsub input 
["\u000a" 
," S" 
," as " 
," co" 
," ide" 
," in " 
," li" 
," n" 
," p" 
," the " 
," us" 
," ve" 
," w" 
,"\"" 
,"&ndash;" 
,"(" 
,")" 
,". " 
,"0" 
,"He" 
,"Suffix array" 
,"`" 
,"a su" 
,"at " 
,"code" 
,"com" 
,"ct" 
,"do" 
,"e f" 
,"ec" 
,"ed " 
,"ei" 
,"ent" 
,"ere's a " 
,"find" 
,"her" 
,"https://" 
,"ib" 
,"ie" 
,"ing " 
,"ion " 
,"is" 
,"ith" 
,"iv" 
,"k" 
,"mon" 
,"na" 
,"no" 
,"nst" 
,"ons" 
,"or" 
,"pdf" 
,"ri" 
,"s are " 
,"se" 
,"sing" 
,"sub" 
,"supermaximal repeats" 
,"te" 
,"ti" 
,"tr" 
,"ub " 
,"uffix arrays" 
,"via" 
,"y, " 
] 
0

我會用Knuth–Morris–Pratt algorithm(線性時間複雜度爲O(n))找到子。我會嘗試找到最大的子字符串模式,將其從輸入字符串中移除並嘗試查找第二大字符串等。我會做這樣的事情:

string pattern = input.substring(0,lenght/2); 
string toMatchString = input.substring(pattern.length, input.lenght - 1); 

List<string> matches = new List<string>(); 

while(pattern.lenght > 0) 
{ 
    int index = KMP(pattern, toMatchString); 
    if(index > 0) 
    { 
     matches.Add(pattern); 

     // remove the matched pattern occurences from the input string 
     // I would do something like this: 
     // 0 to pattern.lenght gets removed 
     // check for all occurences of pattern in toMatchString and remove them 
     // get the remaing shrinked input, reassign values for pattern & toMatchString 
     // keep looking for the next largest substring 
    } 
    else 
    { 
     pattern = input.substring(0, pattern.lenght - 1); 
     toMatchString = input.substring(pattern.length, input.lenght - 1); 
    } 
} 

KMP實現克努特莫里斯普拉特算法。你可以在GithubPrinceton找到它的Java實現或自己寫。

PS:我沒有使用Java編寫代碼,而且它很快嘗試着我的第一個賞金即將關閉。所以,如果我錯過了一些微不足道的東西或者發生了+/- 1錯誤,請不要給我棍棒。

相關問題