快速字符串搜索找到與givven模式匹配的字符串數組元素

我有一個常數字符串數組，我遍歷它來查找包含搜索模式的字符串的元素索引。我應該選擇哪種搜索算法來提高查找此元素的速度？如果有必要，我在運行準備查詢表的應用程序之前沒有時間限制。快速字符串搜索找到與givven模式匹配的字符串數組元素

我糾正一個問題 - 我沒有做精確的字符串匹配 - 我的元素，它是在一個陣列內搜索模式

array of strings: 
[0] Red fox jumps over the fence 
[1] Blue table 
[2] Red flowers on the fence

我需要找到其中包含單詞「表」的元素 - 在這種情況下，它的元素1

我喜歡50000次迭代的一組30個數組，它可以包含多達30000個不少於128個字符的字符串。現在我正在使用老式的strstr蠻力，這太慢了...

好吧，發佈我的函數的一部分，第一個strstr - 查找一個未切割的行數組，如果有任何事件發生，那麼蠻力搜索如下。我知道我可以加快這部分，但我不是這種方法做優化...

// sheets[i].buffer - contains a page of a text which is split into lines 
// fullfunccall - is the pattern 
// sheets[i].cells[k] - are the pointers to lines in a buffer 

for(i=0; i<sheetcount; i++) { 
    if(i!= currsheet && sheets[i].name && sheets[i].name[0] != '\0') { 
    if(strstr(sheets[i].buffer, fullfunccall)) { 
     usedexternally = 1; 

     int foundatleastone = 0; 
     for(k=0; k<sheets[i].numcells; k++) { 
     strncpy_s(testline, MAX_LINE_SIZE, sheets[i].cells[k]->line, sheets[i].cells[k]->linesize); 
     testline[sheets[i].cells[k]->linesize] = '\0'; 

     if(strstr(testline, fullfunccall)) { 
      dependency_num++; 

      if(dependency_num >= MAX_CELL_DEPENDENCIES-1) { 
      printf("allocation for sheet cell dependencies is insuficcient\n"); 
      return; 
      } 

      sheets[currsheet].cells[currcellposinsheet]->numdeps = dependency_num+1; 
      foundatleastone++; 
      sheets[currsheet].cells[currcellposinsheet]->deps[dependency_num] = &sheets[i].cells[k]; 
     } 
     } 

     if(foundatleastone == 0) { 
     printf("error locating dependency for external func: %s\n", fullfunccall); 
     return; 
     } 
    } 
    }; 
}

來源

2014-10-06 Ulterior

@DoomProg 30000個元素，包含一些200個字符的字符串 – Ulterior 2014-10-06 12:51:00

這是不完全清楚你在問什麼。你能舉個例子說明你在找什麼嗎？ – 2014-10-06 12:55:04

@MOehm我編輯了包含示例的問題 – Ulterior 2014-10-06 12:57:01

我認爲最實用的是DFA，因爲它讀取輸入的每個字符最多一次 - 更確切地說它讀取每輸入一次字符，只要模式不匹配，就立即停止（如果設置正確）。藉助DFA，您還可以同時檢查多種模式。

兩個很好（但不同）在實踐中很好的測試的DFA算法的實現是

PIRE
Ragel

這是不可能說，除非你提供更多的對哪些適合你的任務。

編輯：DFA停留「確定性有限自動機」

編輯：正如你指出你的模式是精確子最常見的解決方案是KMP算法（克努特莫里斯普拉特）

來源

2014-10-06 13:00:10

我喜歡一組30個數組的50000次迭代，最多可包含30000個不少於128個字符的字符串 – Ulterior 2014-10-06 13:02:51

如果50000次迭代意味着50000個靜態搜索模式，那麼DFA絕對是贏家將許多模式合併成單個迭代。 PIRE專爲此而設計，而Ragel可以在您的一些幫助下處理此問題（例如，以其輸入語言生成源代碼） – 2014-10-06 13:11:59

感謝Peter，但我期望有一些3或4個功能用於此作業，而不是圖書館，但我將擁有一看 – Ulterior 2014-10-06 13:27:33

你寫你的'haystack'（要搜索的字符串集合）大約是30000個字符串，每個200個字符。你還寫了'針'（搜索的術語）是一個5或20個字符的字符串。

基於此，您可以預先計算一個散列表，它將任何5個字符的子序列映射到乾草堆中的字符串中。對於30000個字符串（每個200個字符），最多有30000 *（200 - 5）= 5.850.000個不同的5字符子串。如果你將它們中的每一個散列到16位校驗和上，你需要至少11MB的內存（用於散列鍵）以及一些指向子串出現的字符串的指針。

例如，給定的

static const char *haystack[] = { "12345", "123456", "23456", "345678" };

一個簡化的草垛你預先計算哈希地圖，地圖的任何可能的5個字符的字符串，從而

12345 => haystack[0], haystack[1] 
23456 => haystack[1], haystack[2] 
34567 => haystack[3] 
45678 => haystack[4]

有了這個，你可以採取的第一個五年給定密鑰的字符（長度爲5或20個字符），對其進行散列，然後通過哈希映射密鑰的所有字符串執行正常的strstr。

來源

2014-10-06 13:46:19

我肯定會嘗試 – Ulterior 2014-10-06 13:57:59

對於您正在治療的每張紙，您可以按照this article中所述構建後綴數組。在開始搜索之前，請閱讀工作表，找到行開頭（作爲表格緩衝區中的整數索引），創建後綴數組並按照文章中的描述進行排序。

現在，如果您正在查找某個模式（例如「table」）出現的行，您可以搜索「table」之後的下一個條目和「tablf」之後的下一個條目，這是第一個非 - 匹配，你已經移動了最右邊的字母，里程錶式。

如果兩個索引相同，則沒有匹配。如果它們是不同的，你會得到一個指針列表進入表：

"tab. And now ..." 
---------------------------------------------------------------- 
"table and ..."    0x0100ab30 
"table water for ..."   0x0100132b 
"tablet computer ..."   0x01000208 
---------------------------------------------------------------- 
"tabloid reporter ..."

這會給你置身其中，減去片緩衝區的基指針的指針的列表，你可以得到的整數偏移。與行開頭的比較會給你對應這些指針的行號。（行號是排序的，所以你可以在這裏進行二分搜索。）

內存開銷是一個指針數組，其大小與表緩衝區大小相同，因此對於200個字符的30,000個字符串，在64位機器上爲48MB。（行索引的開銷可以忽略不計。）

對數組排序需要很長時間，但對於每個工作表只會執行一次。

編輯：這個想法似乎運作良好。我已經實現了它，並可以掃描約130,000字上text file of nearly 600k一本字典在不到一秒鐘：

#include <stdlib.h> 
#include <stdio.h> 
#include <string.h> 

#define die(...) exit((fprintf(stderr, "Fatal: " __VA_ARGS__), \ 
    putc(10, stderr), 1)) 



typedef struct Sheet Sheet;  

struct Sheet { 
    size_t size; /* Number of chars */ 
    char *buf;  /* Null-terminated char buffer */ 
    char **ptr;  /* Pointers into char buffer */ 
    size_t nline; /* number of lines */ 
    int *line;  /* array of offset of line beginnings */ 
    size_t naux; /* size of scratch array */ 
    char **aux;  /* scratch array */ 
}; 


/* 
*  Count occurrence of c in zero-terminated string p. 
*/ 
size_t strcount(const char *p, int c) 
{ 
    size_t n = 0; 

    for (;;) { 
     p = strchr(p, c); 
     if (p == NULL) return n; 
     p++; 
     n++;   
    } 

    return 0; 
} 

/* 
*  String comparison via pointers to strings. 
*/ 
int pstrcmp(const void *a, const void *b) 
{ 
    const char *const *aa = a; 
    const char *const *bb = b; 

    return strcmp(*aa, *bb); 
} 

/* 
*  Pointer comparison. 
*/ 
int ptrcmp(const void *a, const void *b) 
{ 
    const char *const *aa = a; 
    const char *const *bb = b; 

    if (*aa == *bb) return 0; 
    return (*aa < *bb) ? -1 : 1; 
} 

/* 
*  Create and prepare a sheet, i.e. a text file to search. 
*/ 
Sheet *sheet_new(const char *fn) 
{ 
    Sheet *sheet; 
    FILE *f = fopen(fn, "r"); 
    size_t n; 
    int last; 
    char *p; 
    char **pp; 

    if (f == NULL) die("Couldn't open %s", fn); 

    sheet = malloc(sizeof(*sheet)); 
    if (sheet == NULL) die("Allocation failed"); 

    fseek(f, 0, SEEK_END); 
    sheet->size = ftell(f); 
    fseek(f, 0, SEEK_SET); 

    sheet->buf = malloc(sheet->size + 1); 
    sheet->ptr = malloc(sheet->size * sizeof(*sheet->ptr)); 

    if (sheet->buf == NULL) die("Allocation failed"); 
    if (sheet->ptr == NULL) die("Allocation failed"); 

    fread(sheet->buf, 1, sheet->size, f); 
    sheet->buf[sheet->size] = '\0'; 
    fclose(f); 

    sheet->nline = strcount(sheet->buf, '\n'); 
    sheet->line = malloc(sheet->nline * sizeof(*sheet->line)); 

    sheet->aux = NULL; 
    sheet->naux = 0; 

    n = 0; 
    last = 0; 
    p = sheet->buf; 
    pp = sheet->ptr; 
    while (*p) { 
     *pp++ = p; 
     if (*p == '\n') { 
      sheet->line[n++] = last; 
      last = p - sheet->buf + 1; 
     } 
     p++; 
    } 

    qsort(sheet->ptr, sheet->size, sizeof(*sheet->ptr), pstrcmp); 

    return sheet; 
} 

/* 
*  Clean up sheet. 
*/ 
void sheet_delete(Sheet *sheet) 
{ 
    free(sheet->buf); 
    free(sheet->ptr); 
    free(sheet->line); 
    free(sheet->aux); 
    free(sheet); 
} 

/* 
*  Binary range search for string pointers. 
*/ 
static char **pstr_bsearch(const char *key, 
    char **arr, size_t high) 
{ 
    size_t low = 0; 

    while (low < high) { 
     size_t mid = (low + high)/2; 
     int diff = strcmp(key, arr[mid]); 

     if (diff < 0) high = mid; 
     else low = mid + 1; 
    } 

    return arr + low; 
} 

/* 
*  Binary range search for line offsets. 
*/ 
static const int *int_bsearch(int key, const int *arr, size_t high) 
{ 
    size_t low = 0; 

    while (low < high) { 
     size_t mid = (low + high)/2; 
     int diff = key - arr[mid]; 

     if (diff < 0) high = mid; 
     else low = mid + 1; 
    } 

    if (low < 1) return NULL; 
    return arr + low - 1; 
} 

/* 
*  Find occurrences of the string key in the sheet. Returns the 
*  number of lines in which the key occurs and assigns up to 
*  max lines to the line array. (If max is 0, line may be NULL.) 
*/ 
int sheet_find(Sheet *sheet, char *key, 
    int line[], int max) 
{ 
    char **begin, **end; 
    int n = 0; 
    size_t i, m; 
    size_t last; 

    begin = pstr_bsearch(key, sheet->ptr, sheet->size); 
    if (begin == NULL) return 0; 

    key[strlen(key) - 1]++; 
    end = pstr_bsearch(key, sheet->ptr, sheet->size); 
    key[strlen(key) - 1]--; 
    if (end == NULL) return 0; 
    if (end == begin) return 0; 

    m = end - begin; 
    if (m > sheet->naux) { 
     if (sheet->naux == 0) sheet->naux = 0x100; 
     while (sheet->naux < m) sheet->naux *= 2; 
     sheet->aux = realloc(sheet->aux, sheet->naux * sizeof(*sheet->aux)); 
     if (sheet->aux == NULL) die("Re-allocation failed");   
    } 

    memcpy(sheet->aux, begin, m * sizeof(*begin)); 
    qsort(sheet->aux, m, sizeof(*begin), ptrcmp); 

    last = 0; 
    for (i = 0; i < m; i++) { 
     int offset = sheet->aux[i] - sheet->buf; 
     const int *p; 

     p = int_bsearch(offset, sheet->line + last, sheet->nline - last); 

     if (p) { 
      if (n < max) line[n] = p - sheet->line; 
      last = p - sheet->line + 1; 
      n++; 
     } 
    } 

    return n; 
} 

/* 
*  Example client code 
*/ 
int main(int argc, char **argv) 
{ 
    Sheet *sheet; 
    FILE *f; 

    if (argc != 3) die("Usage: %s patterns corpus", *argv); 

    sheet = sheet_new(argv[2]); 

    f = fopen(argv[1], "r"); 
    if (f == NULL) die("Can't open %s.", argv[1]); 
    for (;;) { 
     char str[80]; 
     int line[50]; 
     int i, n; 

     if (fgets(str, sizeof(str), f) == NULL) break; 
     strtok(str, "\n"); 
     n = sheet_find(sheet, str, line, 50); 
     printf("%8d %s\n", n, str); 

     if (n > 50) n = 50; 
     for (i = 0; i < n; i++) printf(" [%d] %d\n", i, line[i] + 1); 
    } 
    fclose(f); 

    sheet_delete(sheet); 

    return 0; 
}

實施有粗糙的邊緣，但它的作品。我並不特別喜歡scratch數組和附加的排序，但是事實證明，即使排序大的後綴數組也不需要太長時間。

如果您願意，您可以將此解決方案擴展到更多工作表。

來源

2014-10-06 14:23:10

快速字符串搜索找到與givven模式匹配的字符串數組元素

回答

相關問題