通配符字符串搜索算法

在我的程序中，我需要在相當大的字符串（〜1 mb）中搜索相對較小的子字符串（< 1 kb）。問題是字符串包含「a？c」意義上的簡單通配符，這意味着我想要搜索字符串，如「abc」或「apc」，...（我只對第一次發生感興趣）。通配符字符串搜索算法

直到現在我用瑣碎的方法（在這裏僞代碼）

algorithm "search", input: haystack(string), needle(string) 

for(i = 0, i < length(haystack), ++i) 
if(!CompareMemory(haystack+i,needle,length(needle)) 
    return i; 

return -1; (Not found)

其中「CompareMemory」當且僅當第一個和第二個參數返回0是相同的（也涉及通配符）只針對字節的第三個量論點給出。

我的問題是現在如果有一個快速的算法（你不必給它，但如果你喜歡C++，c或僞代碼）。我開始here 但我認爲大多數快速算法不允許通配符（通過它們利用字符串的性質）。

我希望問題的格式可以，因爲我是新來的，提前謝謝！

來源

2011-08-15 Listing

如何使用正則表達式與您當前的解決方案進行比較？這裏提到了一些優秀的庫：http://stackoverflow.com/questions/181624/c-what-regex-library-should-i-use –

在c個字符的字母表中的字符串中，讓S具有長度s並讓T_1 ... T_k具有平均長度b。 S將被搜索每個k個目標字符串。（問題陳述沒有提到給定字符串的多次搜索;我在下面提到它，因爲在那個範例中，我的程序很好。）

該程序使用O（s + c）時間和空間進行設置，如果S和T_i是隨機串）O（k * u * s/c）+ O（k * b + k * b * s/c^u）用於搜索的總時間，在程序中u = 3。對於更長的目標，你應該增加，並選擇罕見，廣泛分離的關鍵字符。

在步驟1中，程序創建一個s + TsizMax整數的數組L（在程序中，TsizMax =允許的目標長度）並將其用於c列表中下一個出現的字符的位置，列表頭在H []中和T []中的尾巴。這是O（s + c）時間和空間的一步。

在步驟2中，程序反覆讀取並處理目標字符串。步驟2A選擇u = 3個不同的非通配關鍵字符（在當前目標中）。如圖所示，該程序僅使用前三個這樣的字符;只需稍微多一點工作，就可以使用目標中最稀有的字符來提高性能。請注意，它不適用於少於三個此類角色的目標。行「L [T [r]] = L [g + i] = g + i;」在步驟2A中，在L中設置一個保護單元，使其具有適當的增量偏移，使得步驟2G在搜索結束時自動執行，而不需要在搜索期間進行任何額外的測試。 T [r]將字符串r的列表尾單元索引，因此單元格L [g + i]成爲字符r的新的自引用列表結尾。（這種技術允許循環以最少的外部條件測試運行。）

步驟2B將變量a，b，c設置爲列表頭部位置，並設置與目標中所選關鍵字符之間的距離相對應的增量dab，dac和dbc。

步驟2C檢查是否在S中出現關鍵字符。這一步是必需的，否則步驟2E中的while循環將掛起。我們不希望在while循環中進行更多檢查，因爲它們是搜索的內部循環。

步驟2D執行步驟2E至2i，直到var c指向S結束之後，此時不可能再進行任何匹配。

步驟2E由u = 3 while循環組成，即「強制增量距離」，也就是說，只要索引a，b，c沿着彼此爬行，只要它們不是與模式兼容的。 while循環相當快，對於各種v，d，w，本質上（除去++ si工具）「while（v + d < w）v = L [v]」。幾次複製三個while循環可能會提高性能，並且不會改變淨結果。

在步驟2G中，我們知道u個關鍵字符匹配，所以我們對目標與匹配點進行完整比較，並使用通配符處理。步驟2H報告比較的結果。節目也給出了本節中的不匹配;刪除生產中的。

步驟2I前進所有關鍵字符索引，因爲沒有當前索引的字符可以成爲另一個匹配的關鍵部分。

您可以運行該程序來查看一些操作計數統計信息。例如，輸出

Target 5=<de?ga> 

abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg 

@ 17, de?ga and de3ga match 
@ 24, de?ga and defg4 differ 
@ 31, de?ga and defga match 
Advances: 'd' 0+3 'e' 3+3 'g' 3+3 = 6+9 = 15

表示步驟2G被輸入了3次（即關鍵字符匹配了3次）。全部比較成功兩次;步驟2E while循環提前索引6次;步驟2I高級索引9次;總共有15個進步，用於搜索42個字符的字符串以指定目標。

/* jiw 
$Id: stringsearch.c,v 1.2 2011/08/19 08:53:44 j-waldby Exp j-waldby $ 

Re: Concept-code for searching a long string for short targets, 
    where targets may contain wildcard characters. 

The user can enter any number of targets as command line parameters. 
This code has 2 long strings available for testing; if the first 
character of the first parameter is '1' the jay[42] string is used, 
else kay[321]. 

Eg, for tests with *hay = jay use command like 

    ./stringsearch 1e?g a?cd bc?e?g c?efg de?ga ddee? ddee?f 

or with *hay = kay, 

    ./stringsearch bc?e? jih? pa?j ?av??j 

to exercise program. 

Copyright 2011 James Waldby. Offered without warranty 
under GPL v3 terms as at http://www.gnu.org/licenses/gpl.html 
*/ 
#include <stdlib.h> 
#include <stdio.h> 
#include <string.h> 
#include <limits.h> 
//================================================ 
int main(int argc, char *argv[]) { 
    char jay[]="abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg"; 
    char kay[]="ludehkhtdiokihtmaihitoia1htkjkkchajajavpajkihtijkhijhipaja" 
    "etpajamhkajajacpajihiatokajavtoia2pkjpajjhiifakacpajjhiatkpajfojii" 
    "etkajamhpajajakpajihiatoiakavtoia3pakpajjhiifakacpajjhkatvpajfojii" 
    "ihiifojjjjhijpjkhtfdoiajadijpkoia4jihtfjavpapakjhiifjpajihiifkjach" 
    "ihikfkjjjjhijpjkhtfdoiajakijptoik4jihtfjakpapajjkiifjpajkhiifajkch"; 
    char *hay = (argc>1 && argv[1][0]=='1')? jay:kay; 

    enum { chars=1<<CHAR_BIT, TsizMax=40, Lsiz=TsizMax+sizeof kay, L1, L2 }; 
    int L[L2], H[chars], T[chars], g, k, par; 

    // Step 1. Make arrays L, H, T. 
    for (k=0; k<chars; ++k) H[k] = T[k] = L1; // Init H and T 
    for (g=0; hay[g]; ++g) { // Make linked character lists for hay. 
    k = hay[g];   // In same loop, could count char freqs. 
    if (T[k]==L1) H[k] = T[k] = g; 
    T[k] = L[T[k]] = g; 
    } 

    // Step 2. Read and process target strings. 
    for (par=1; par<argc; ++par) { 
    int alpha[3], at[3], a=g, b=g, c=g, da, dab, dbc, dac, i, j, r; 
    char * targ = argv[par]; 
    enum { wild = '?' }; 
    int sa=0, sb=0, sc=0, ta=0, tb=0, tc=0; 
    printf ("Target %d=<%s>\n", par, targ); 

    // Step 2A. Choose 3 non-wild characters to follow. 
    // As is, chooses first 3 non-wilds for a,b,c. 
    // Could instead choose 3 rarest characters. 
    for (j=0; j<3; ++j) alpha[j] = -j; 
    for (i=j=0; targ[i] && j<3; ++i) 
     if (targ[i] != wild) { 
     r = alpha[j] = targ[i]; 
     if (alpha[0]==alpha[1] || alpha[1]==alpha[2] 
      || alpha[0]==alpha[2]) continue; 
     at[j] = i; 
     L[T[r]] = L[g+i] = g+i; 
     ++j; 
     } 
    if (j != 3) { 
     printf (" Too few target chars\n"); 
     continue; 
    } 

    // Step 2B. Set a,b,c to head-of-list locations, set deltas. 
    da = at[0]; 
    a = H[alpha[0]]; dab = at[1]-at[0]; 
    b = H[alpha[1]]; dbc = at[2]-at[1]; 
    c = H[alpha[2]]; dac = at[2]-at[0]; 

    // Step 2C. See if key characters appear in haystack 
    if (a >= g || b >= g || c >= g) { 
     printf (" No match on some character\n"); 
     continue;  
    } 

    for (g=0; hay[g]; ++g) printf ("%d", g%10); 
    printf ("\n%s\n", hay); // Show haystack, for user aid 

    // Step 2D. Search for match 
    while (c < g) { 
     // Step 2E. Enforce delta distances 
     while (a+dab < b) {a = L[a]; ++sa; } // Replicate these 
     while (b+dbc < c) {b = L[b]; ++sb; } // 3 abc lines as many 
     while (a+dac > c) {c = L[c]; ++sc; } // times as you like. 
     while (a+dab < b) {a = L[a]; ++sa; } // Replicate these 
     while (b+dbc < c) {b = L[b]; ++sb; } // 3 abc lines as many 
     while (a+dac > c) {c = L[c]; ++sc; } // times as you like. 

     // Step 2F. See if delta distances were met 
     if (a+dab==b && b+dbc==c && c<g) { 
     // Step 2G. Yes, so we have 3-letter-match and need to test whole match. 
     r = a-da; 
     for (k=0; targ[k]; ++k) 
      if ((hay[r+k] != targ[k]) && (targ[k] != wild)) 
      break; 
     printf ("@ %3d, %s and ", r, targ); 
     for (i=0; targ[i]; ++i) putchar(hay[r++]); 
     // Step 2H. Report match, if found 
     puts (targ[k]? " differ" : " match"); 
     // Step 2I. Advance all of a,b,c, to go on looking 
     a = L[a]; ++ta; 
     b = L[b]; ++tb; 
     c = L[c]; ++tc; 
     } 
    } 
    printf ("Advances: '%c' %d+%d '%c' %d+%d '%c' %d+%d = %d+%d = %d\n", 
     alpha[0], sa,ta, alpha[1], sb,tb, alpha[2], sc,tc, 
     sa+sb+sc, ta+tb+tc, sa+sb+sc+ta+tb+tc); 
    } 
    return 0; 
}

注意，如果你喜歡這個答案比當前的首選答案更好，取消標記之一，標誌着這一個。 :)

來源

2011-08-20 04:35:35

不錯，我沒有想到這樣詳細的答案:-) – Listing

正則表達式通常使用finite state automation-based search，我想。嘗試實施。

來源

2011-08-15 15:51:34 Ryan

是的，一些非常快的FSM代碼可以由Ragel FSM編譯器生成， http://www.complang.org/ragel/ – john

一種快速的方式，與使用正則表達式類似（我會推薦），是找到一些固定在針，「a」中，但不是「？」，並搜索爲它，然後看看你是否有完整的比賽。

j = firstNonWildcardPos(needle) 
for(i = j, i < length(haystack)-length(needle)+j, ++i) 
    if(haystack[i] == needle[j]) 
    if(!CompareMemory(haystack+i-j,needle,length(needle)) 
     return i; 

return -1; (Not found)

正則表達式會產生類似這樣的代碼（我相信）。

來源

2011-08-15 16:29:50 nulvinge

這樣做更快，因爲它不必經常調用CompareMemmory（因爲needle [j]不經常出現）。然後你可以使用第二個非擴展卡來使其更快更快 – nulvinge

謝謝。愚蠢的是，我沒有自己想出這個。我會留下這個問題，因爲當我的數據集中非常頻繁地出現第一個nonwildcard部分時，這幾乎等同於天真的方法。 – Listing

通配符字符串搜索算法

回答

相關問題