Boyer-Moore的算法利用了這樣一個事實,即當在一個較長的字符串內搜索「HELLO WORLD」時,如果要找到匹配,在給定位置中找到的字母會限制該位置周圍的內容完全可以說是海軍戰鬥遊戲:如果你在邊界的四個單元格中發現大海,你就不需要測試剩餘的四個單元格,以防在那裏隱藏一個5格的載體;不可能有。
如果您在第十一位發現了例如'D',它可能是HELLO WORLD的最後一個字母;但是如果你發現'Q','Q'不在HELLO WORLD內部的任何地方,這意味着搜索到的字符串不能在前11個字符中的任何地方,並且你可以避免在那裏完全搜索。另一方面,'L'可能意味着HELLO WORLD在那裏,從11-3位置(HELLO WORLD的第三個字母是L),11-4或11-10開始。
搜索時,使用兩個增量數組記錄這些可能性。
所以,當你找到一個模式,你應該做的,
if (j < 0)
{
// Found a pattern from position i+1 to i+1+patlen
// Add vector or whatever is needed; check we don't overflow it.
if (index_size+1 >= index_counter)
{
index[index_counter] = 0;
return index_size;
}
index[index_counter++] = i+1;
// Reinitialize j to restart search
j = patlen-1;
// Reinitialize i to start at i+1+patlen
i += patlen +1; // (not completely sure of that +1)
// Do not free delta2
// free(delta2);
// Continue loop without altering i again
continue;
}
i += max(delta1[string[i]], delta2[j]);
}
free(delta2);
index[index_counter] = 0;
return index_counter;
這將返回索引的零結尾的名單,只要你通過像一個size_t *indexes
的功能。
函數將返回0(未找到),index_size(過多匹配)或1和index_size-1之間的匹配數。
這允許例如添加額外的匹配,而不必重複已找到的(index_size-1)子字符串的整個搜索;你通過new_num增加num_indexes
,realloc
的indexes
陣列,然後傳遞給函數的新的數組,在偏移old_index_size-1
,new_num作爲新的大小,以及從開始草堆串中的所述索引old_index_size-1
加一個(不匹配的偏移量,正如我在之前的修訂版中寫的,加上針串的長度;請參閱評論)。
這種方法將報告還重疊的匹配,例如搜索ANA在香蕉會發現B * ANA * NA和禁令* ANA *。
UPDATE
我測試以上,它似乎工作。我加入這兩個包括讓GCC從嘴裏嘟囔着
#include <stdio.h>
#include <string.h>
然後我修改了if (j < 0)
修改維基百科代碼簡單地輸出它發現了
if (j < 0) {
printf("Found %s at offset %d: %s\n", pat, i+1, string+i+1);
//free(delta2);
// return (string + i+1);
i += patlen + 1;
j = patlen - 1;
continue;
}
,最後我用這個
測試
int main(void)
{
char *s = "This is a string in which I am going to look for a string I will string along";
char *p = "string";
boyer_moore(s, strlen(s), p, strlen(p));
return 0;
}
得到,如預期:
Found string at offset 10: string in which I am going to look for a string I will string along
Found string at offset 51: string I will string along
Found string at offset 65: string along
如果字符串包含兩個重疊的序列,兩者都發現:
char *s = "This is an andean andeandean andean trouble";
char *p = "andean";
Found andean at offset 11: andean andeandean andean trouble
Found andean at offset 18: andeandean andean trouble
Found andean at offset 22: andean andean trouble
Found andean at offset 29: andean trouble
爲了避免重疊的匹配,最快的方法是不存儲的重疊。它可以在函數中完成,但它意味着重新初始化第一個增量矢量並更新字符串指針;我們還需要將第二個i
索引存儲爲i2
,以使保存的索引不會變得非單調。這不值得。更好:
if (j < 0) {
// We have found a patlen match at i+1
// Is it an overlap?
if (index && (indexes[index] + patlen < i+1))
{
// Yes, it is. So we don't store it.
// We could store the last of several overlaps
// It's not exactly trivial, though:
// searching 'anana' in 'Bananananana'
// finds FOUR matches, and the fourth is NOT overlapped
// with the first. So in case of overlap, if we want to keep
// the LAST of the bunch, we must save info somewhere else,
// say last_conflicting_overlap, and check twice.
// Then again, the third match (which is the last to overlap
// with the first) would overlap with the fourth.
// So the "return as many non overlapping matches as possible"
// is actually accomplished by doing NOTHING in this branch of the IF.
}
else
{
// Not an overlap, so store it.
indexes[++index] = i+1;
if (index == max_indexes) // Too many matches already found?
break; // Stop searching and return found so far
}
// Adapt i and j to keep searching
i += patlen + 1;
j = patlen - 1;
continue;
}
感謝您發佈此信息,我會嘗試將此工作納入代碼並查看它是如何實現的。 – Chase
我添加了您編寫的代碼,它似乎仍然停留在單個匹配項中。我明天將不得不進一步檢查。我在想也許有一些需要用'i'變量的對齊方式?我仍然沒有完全得到算法,但我可以看到如何可能需要進行調整或關於表的一些事情。 – Chase
我已驗證算法並對維基百科代碼進行了簡單修改(現在將其添加到我的答案中) – LSerni