對於您正在治療的每張紙,您可以按照this article中所述構建後綴數組。在開始搜索之前,請閱讀工作表,找到行開頭(作爲表格緩衝區中的整數索引),創建後綴數組並按照文章中的描述進行排序。
現在,如果您正在查找某個模式(例如「table」)出現的行,您可以搜索「table」之後的下一個條目和「tablf」之後的下一個條目,這是第一個非 - 匹配,你已經移動了最右邊的字母,里程錶式。
如果兩個索引相同,則沒有匹配。如果它們是不同的,你會得到一個指針列表進入表:
"tab. And now ..."
----------------------------------------------------------------
"table and ..." 0x0100ab30
"table water for ..." 0x0100132b
"tablet computer ..." 0x01000208
----------------------------------------------------------------
"tabloid reporter ..."
這會給你置身其中,減去片緩衝區的基指針的指針的列表,你可以得到的整數偏移。與行開頭的比較會給你對應這些指針的行號。 (行號是排序的,所以你可以在這裏進行二分搜索。)
內存開銷是一個指針數組,其大小與表緩衝區大小相同,因此對於200個字符的30,000個字符串,在64位機器上爲48MB。 (行索引的開銷可以忽略不計。)
對數組排序需要很長時間,但對於每個工作表只會執行一次。
編輯:這個想法似乎運作良好。我已經實現了它,並可以掃描約130,000字上text file of nearly 600k一本字典在不到一秒鐘:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#define die(...) exit((fprintf(stderr, "Fatal: " __VA_ARGS__), \
putc(10, stderr), 1))
typedef struct Sheet Sheet;
struct Sheet {
size_t size; /* Number of chars */
char *buf; /* Null-terminated char buffer */
char **ptr; /* Pointers into char buffer */
size_t nline; /* number of lines */
int *line; /* array of offset of line beginnings */
size_t naux; /* size of scratch array */
char **aux; /* scratch array */
};
/*
* Count occurrence of c in zero-terminated string p.
*/
size_t strcount(const char *p, int c)
{
size_t n = 0;
for (;;) {
p = strchr(p, c);
if (p == NULL) return n;
p++;
n++;
}
return 0;
}
/*
* String comparison via pointers to strings.
*/
int pstrcmp(const void *a, const void *b)
{
const char *const *aa = a;
const char *const *bb = b;
return strcmp(*aa, *bb);
}
/*
* Pointer comparison.
*/
int ptrcmp(const void *a, const void *b)
{
const char *const *aa = a;
const char *const *bb = b;
if (*aa == *bb) return 0;
return (*aa < *bb) ? -1 : 1;
}
/*
* Create and prepare a sheet, i.e. a text file to search.
*/
Sheet *sheet_new(const char *fn)
{
Sheet *sheet;
FILE *f = fopen(fn, "r");
size_t n;
int last;
char *p;
char **pp;
if (f == NULL) die("Couldn't open %s", fn);
sheet = malloc(sizeof(*sheet));
if (sheet == NULL) die("Allocation failed");
fseek(f, 0, SEEK_END);
sheet->size = ftell(f);
fseek(f, 0, SEEK_SET);
sheet->buf = malloc(sheet->size + 1);
sheet->ptr = malloc(sheet->size * sizeof(*sheet->ptr));
if (sheet->buf == NULL) die("Allocation failed");
if (sheet->ptr == NULL) die("Allocation failed");
fread(sheet->buf, 1, sheet->size, f);
sheet->buf[sheet->size] = '\0';
fclose(f);
sheet->nline = strcount(sheet->buf, '\n');
sheet->line = malloc(sheet->nline * sizeof(*sheet->line));
sheet->aux = NULL;
sheet->naux = 0;
n = 0;
last = 0;
p = sheet->buf;
pp = sheet->ptr;
while (*p) {
*pp++ = p;
if (*p == '\n') {
sheet->line[n++] = last;
last = p - sheet->buf + 1;
}
p++;
}
qsort(sheet->ptr, sheet->size, sizeof(*sheet->ptr), pstrcmp);
return sheet;
}
/*
* Clean up sheet.
*/
void sheet_delete(Sheet *sheet)
{
free(sheet->buf);
free(sheet->ptr);
free(sheet->line);
free(sheet->aux);
free(sheet);
}
/*
* Binary range search for string pointers.
*/
static char **pstr_bsearch(const char *key,
char **arr, size_t high)
{
size_t low = 0;
while (low < high) {
size_t mid = (low + high)/2;
int diff = strcmp(key, arr[mid]);
if (diff < 0) high = mid;
else low = mid + 1;
}
return arr + low;
}
/*
* Binary range search for line offsets.
*/
static const int *int_bsearch(int key, const int *arr, size_t high)
{
size_t low = 0;
while (low < high) {
size_t mid = (low + high)/2;
int diff = key - arr[mid];
if (diff < 0) high = mid;
else low = mid + 1;
}
if (low < 1) return NULL;
return arr + low - 1;
}
/*
* Find occurrences of the string key in the sheet. Returns the
* number of lines in which the key occurs and assigns up to
* max lines to the line array. (If max is 0, line may be NULL.)
*/
int sheet_find(Sheet *sheet, char *key,
int line[], int max)
{
char **begin, **end;
int n = 0;
size_t i, m;
size_t last;
begin = pstr_bsearch(key, sheet->ptr, sheet->size);
if (begin == NULL) return 0;
key[strlen(key) - 1]++;
end = pstr_bsearch(key, sheet->ptr, sheet->size);
key[strlen(key) - 1]--;
if (end == NULL) return 0;
if (end == begin) return 0;
m = end - begin;
if (m > sheet->naux) {
if (sheet->naux == 0) sheet->naux = 0x100;
while (sheet->naux < m) sheet->naux *= 2;
sheet->aux = realloc(sheet->aux, sheet->naux * sizeof(*sheet->aux));
if (sheet->aux == NULL) die("Re-allocation failed");
}
memcpy(sheet->aux, begin, m * sizeof(*begin));
qsort(sheet->aux, m, sizeof(*begin), ptrcmp);
last = 0;
for (i = 0; i < m; i++) {
int offset = sheet->aux[i] - sheet->buf;
const int *p;
p = int_bsearch(offset, sheet->line + last, sheet->nline - last);
if (p) {
if (n < max) line[n] = p - sheet->line;
last = p - sheet->line + 1;
n++;
}
}
return n;
}
/*
* Example client code
*/
int main(int argc, char **argv)
{
Sheet *sheet;
FILE *f;
if (argc != 3) die("Usage: %s patterns corpus", *argv);
sheet = sheet_new(argv[2]);
f = fopen(argv[1], "r");
if (f == NULL) die("Can't open %s.", argv[1]);
for (;;) {
char str[80];
int line[50];
int i, n;
if (fgets(str, sizeof(str), f) == NULL) break;
strtok(str, "\n");
n = sheet_find(sheet, str, line, 50);
printf("%8d %s\n", n, str);
if (n > 50) n = 50;
for (i = 0; i < n; i++) printf(" [%d] %d\n", i, line[i] + 1);
}
fclose(f);
sheet_delete(sheet);
return 0;
}
實施有粗糙的邊緣,但它的作品。我並不特別喜歡scratch數組和附加的排序,但是事實證明,即使排序大的後綴數組也不需要太長時間。
如果您願意,您可以將此解決方案擴展到更多工作表。
@DoomProg 30000個元素,包含一些200個字符的字符串 – Ulterior 2014-10-06 12:51:00
這是不完全清楚你在問什麼。你能舉個例子說明你在找什麼嗎? – 2014-10-06 12:55:04
@MOehm我編輯了包含示例的問題 – Ulterior 2014-10-06 12:57:01