在c個字符的字母表中的字符串中,讓S具有長度s並讓T_1 ... T_k具有平均長度b。 S將被搜索每個k個目標字符串。 (問題陳述沒有提到給定字符串的多次搜索;我在下面提到它,因爲在那個範例中,我的程序很好。)
該程序使用O(s + c)時間和空間進行設置,如果S和T_i是隨機串)O(k * u * s/c)+ O(k * b + k * b * s/c^u)用於搜索的總時間,在程序中u = 3。對於更長的目標,你應該增加,並選擇罕見,廣泛分離的關鍵字符。
在步驟1中,程序創建一個s + TsizMax整數的數組L(在程序中,TsizMax =允許的目標長度)並將其用於c列表中下一個出現的字符的位置,列表頭在H []中和T []中的尾巴。這是O(s + c)時間和空間的一步。
在步驟2中,程序反覆讀取並處理目標字符串。步驟2A選擇u = 3個不同的非通配關鍵字符(在當前目標中)。如圖所示,該程序僅使用前三個這樣的字符;只需稍微多一點工作,就可以使用目標中最稀有的字符來提高性能。請注意,它不適用於少於三個此類角色的目標。行「L [T [r]] = L [g + i] = g + i;」在步驟2A中,在L中設置一個保護單元,使其具有適當的增量偏移,使得步驟2G在搜索結束時自動執行,而不需要在搜索期間進行任何額外的測試。 T [r]將字符串r的列表尾單元索引,因此單元格L [g + i]成爲字符r的新的自引用列表結尾。 (這種技術允許循環以最少的外部條件測試運行。)
步驟2B將變量a,b,c設置爲列表頭部位置,並設置與目標中所選關鍵字符之間的距離相對應的增量dab,dac和dbc。
步驟2C檢查是否在S中出現關鍵字符。這一步是必需的,否則步驟2E中的while循環將掛起。我們不希望在while循環中進行更多檢查,因爲它們是搜索的內部循環。
步驟2D執行步驟2E至2i,直到var c指向S結束之後,此時不可能再進行任何匹配。
步驟2E由u = 3 while循環組成,即「強制增量距離」,也就是說,只要索引a,b,c沿着彼此爬行,只要它們不是與模式兼容的。 while循環相當快,對於各種v,d,w,本質上(除去++ si工具)「while(v + d < w)v = L [v]」。幾次複製三個while循環可能會提高性能,並且不會改變淨結果。
在步驟2G中,我們知道u個關鍵字符匹配,所以我們對目標與匹配點進行完整比較,並使用通配符處理。步驟2H報告比較的結果。節目也給出了本節中的不匹配;刪除生產中的。
步驟2I前進所有關鍵字符索引,因爲沒有當前索引的字符可以成爲另一個匹配的關鍵部分。
您可以運行該程序來查看一些操作計數統計信息。例如,輸出
Target 5=<de?ga>
abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg
@ 17, de?ga and de3ga match
@ 24, de?ga and defg4 differ
@ 31, de?ga and defga match
Advances: 'd' 0+3 'e' 3+3 'g' 3+3 = 6+9 = 15
表示步驟2G被輸入了3次(即關鍵字符匹配了3次)。全部比較成功兩次;步驟2E while循環提前索引6次;步驟2I高級索引9次;總共有15個進步,用於搜索42個字符的字符串以指定目標。
/* jiw
$Id: stringsearch.c,v 1.2 2011/08/19 08:53:44 j-waldby Exp j-waldby $
Re: Concept-code for searching a long string for short targets,
where targets may contain wildcard characters.
The user can enter any number of targets as command line parameters.
This code has 2 long strings available for testing; if the first
character of the first parameter is '1' the jay[42] string is used,
else kay[321].
Eg, for tests with *hay = jay use command like
./stringsearch 1e?g a?cd bc?e?g c?efg de?ga ddee? ddee?f
or with *hay = kay,
./stringsearch bc?e? jih? pa?j ?av??j
to exercise program.
Copyright 2011 James Waldby. Offered without warranty
under GPL v3 terms as at http://www.gnu.org/licenses/gpl.html
*/
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <limits.h>
//================================================
int main(int argc, char *argv[]) {
char jay[]="abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg";
char kay[]="ludehkhtdiokihtmaihitoia1htkjkkchajajavpajkihtijkhijhipaja"
"etpajamhkajajacpajihiatokajavtoia2pkjpajjhiifakacpajjhiatkpajfojii"
"etkajamhpajajakpajihiatoiakavtoia3pakpajjhiifakacpajjhkatvpajfojii"
"ihiifojjjjhijpjkhtfdoiajadijpkoia4jihtfjavpapakjhiifjpajihiifkjach"
"ihikfkjjjjhijpjkhtfdoiajakijptoik4jihtfjakpapajjkiifjpajkhiifajkch";
char *hay = (argc>1 && argv[1][0]=='1')? jay:kay;
enum { chars=1<<CHAR_BIT, TsizMax=40, Lsiz=TsizMax+sizeof kay, L1, L2 };
int L[L2], H[chars], T[chars], g, k, par;
// Step 1. Make arrays L, H, T.
for (k=0; k<chars; ++k) H[k] = T[k] = L1; // Init H and T
for (g=0; hay[g]; ++g) { // Make linked character lists for hay.
k = hay[g]; // In same loop, could count char freqs.
if (T[k]==L1) H[k] = T[k] = g;
T[k] = L[T[k]] = g;
}
// Step 2. Read and process target strings.
for (par=1; par<argc; ++par) {
int alpha[3], at[3], a=g, b=g, c=g, da, dab, dbc, dac, i, j, r;
char * targ = argv[par];
enum { wild = '?' };
int sa=0, sb=0, sc=0, ta=0, tb=0, tc=0;
printf ("Target %d=<%s>\n", par, targ);
// Step 2A. Choose 3 non-wild characters to follow.
// As is, chooses first 3 non-wilds for a,b,c.
// Could instead choose 3 rarest characters.
for (j=0; j<3; ++j) alpha[j] = -j;
for (i=j=0; targ[i] && j<3; ++i)
if (targ[i] != wild) {
r = alpha[j] = targ[i];
if (alpha[0]==alpha[1] || alpha[1]==alpha[2]
|| alpha[0]==alpha[2]) continue;
at[j] = i;
L[T[r]] = L[g+i] = g+i;
++j;
}
if (j != 3) {
printf (" Too few target chars\n");
continue;
}
// Step 2B. Set a,b,c to head-of-list locations, set deltas.
da = at[0];
a = H[alpha[0]]; dab = at[1]-at[0];
b = H[alpha[1]]; dbc = at[2]-at[1];
c = H[alpha[2]]; dac = at[2]-at[0];
// Step 2C. See if key characters appear in haystack
if (a >= g || b >= g || c >= g) {
printf (" No match on some character\n");
continue;
}
for (g=0; hay[g]; ++g) printf ("%d", g%10);
printf ("\n%s\n", hay); // Show haystack, for user aid
// Step 2D. Search for match
while (c < g) {
// Step 2E. Enforce delta distances
while (a+dab < b) {a = L[a]; ++sa; } // Replicate these
while (b+dbc < c) {b = L[b]; ++sb; } // 3 abc lines as many
while (a+dac > c) {c = L[c]; ++sc; } // times as you like.
while (a+dab < b) {a = L[a]; ++sa; } // Replicate these
while (b+dbc < c) {b = L[b]; ++sb; } // 3 abc lines as many
while (a+dac > c) {c = L[c]; ++sc; } // times as you like.
// Step 2F. See if delta distances were met
if (a+dab==b && b+dbc==c && c<g) {
// Step 2G. Yes, so we have 3-letter-match and need to test whole match.
r = a-da;
for (k=0; targ[k]; ++k)
if ((hay[r+k] != targ[k]) && (targ[k] != wild))
break;
printf ("@ %3d, %s and ", r, targ);
for (i=0; targ[i]; ++i) putchar(hay[r++]);
// Step 2H. Report match, if found
puts (targ[k]? " differ" : " match");
// Step 2I. Advance all of a,b,c, to go on looking
a = L[a]; ++ta;
b = L[b]; ++tb;
c = L[c]; ++tc;
}
}
printf ("Advances: '%c' %d+%d '%c' %d+%d '%c' %d+%d = %d+%d = %d\n",
alpha[0], sa,ta, alpha[1], sb,tb, alpha[2], sc,tc,
sa+sb+sc, ta+tb+tc, sa+sb+sc+ta+tb+tc);
}
return 0;
}
注意,如果你喜歡這個答案比當前的首選答案更好,取消標記之一,標誌着這一個。 :)
如何使用正則表達式與您當前的解決方案進行比較?這裏提到了一些優秀的庫:http://stackoverflow.com/questions/181624/c-what-regex-library-should-i-use –