散列大數據集和C實現

我有大量的值範圍從0 - 5463458053.對於每個值，我希望映射包含字符串的集合，以便操作查找，即我。即查找該集合中是否存在字符串佔用的時間最少。請注意，這組值可能不包含（0 - 5463458053）中的所有值，但是，其中包含大量值。散列大數據集和C實現

我當前的解決方案是散列這些值（0之間 - 5463458053）和每一個值，具有對應於該值的字符串的鏈接列表。每次，我想檢查一個給定集合中的字符串，我散列該值（介於0 - 5463458053之間），獲取鏈接列表，並遍歷它以確定它是否包含上述字符串。

雖然這看起來更簡單，但這有點費時。你能想到更快的解決方案嗎？另外，衝突將是可怕的。他們會導致錯誤的結果。

另一部分是關於在C中實現這個。我該如何去做這件事？

注意：使用數據庫，而不是有人建議。我想知道這是否有用。

我有點擔心RAM自然耗盡。 :-)

來源

2011-11-13 user866098

多少？它適合內存嗎？爲什麼字符串表示爲鏈接列表？一個字符串是否可以出現在多個列表中？一個列表中有多少個字符串？你已經有了某種形式的數據（內存，磁盤文件）嗎？ – wildplasser

你顯然沒有讀過我的問題。我說 - 「我目前的解決方案...」，並開始什麼是我的**想法，在那裏我把這組字符串表示爲一個鏈表。你可能有更好的解決方案？我都聽過這個。是的，一個字符串可能出現在多個列表中。 – user866098

你可以有散列表的散列表。第一個哈希表有你的整數鍵。其中的值是散列集，即其鍵爲字符串的散列表。

你也可以有一個哈希集，鍵是整數和字符串。

有許多庫實現這樣的數據結構（並且在C++中，標準庫正在實現它們，因爲std::map & std::set）。對於C，我想到了GTK的Glib。

使用哈希技術，內存使用與所考慮的集合（或關係）的大小成正比。例如，你可以接受30％的空虛率。

來源

2011-11-13 16:52:34

+1，注意在C++中'std :: map'和'std :: set'不是哈希表，而是二叉搜索樹。 –

啊，多哈希好點。另一個 - 鍵是整數和字符串 - 可能會碰撞更多，你不覺得嗎？我沒有使用C++，所以現在不是一個選項。我應該在帖子中提到這一點。我真的有點擔心RAM耗盡。 – user866098

是的，真正的散列事情可能是C++ 11中的'std :: unordered_map'和'std :: unordered_set'。 –

如果條目從0到N並且連續：使用數組。（正在索引速度不夠快嗎？）

編輯：數字似乎並不連續。有一個{key，value}對的大數，其中密鑰是一個大數字（> 32位，但是< 64位），並且該值是一串字符串。

如果內存可用，哈希表很容易，如果字符串的一堆不是太大，你可以依次對其進行檢查。如果相同的字符串不止一次出現，你可以枚舉這些字符串（把它們的指針放在一個char *數組[]中]，然後使用索引到該數組中，找到一個給定字符串的索引可能涉及另一個哈希表）

對於「大師」散列表中的條目很可能是：

struct entry { 
    struct entry *next; /* for overflow chain */ 
    unsigned long long key; /* the 33bits number */ 
    struct list *payload; 
    } entries[big_enough_for_all] ; /* if size is known in advance 
          , preallocation avoids a lot of malloc overhead */

，如果你有足夠的內存來存儲一個頭陣，你chould當然做到這一點：

struct entry *heads[SOME_SIZE] = {NULL, };

，否則您可以將磁頭陣列與條目數組組合。（像我一樣在這裏Lookups on known set of integer keys）

處理衝突易是：當你走溢出鏈，只是在進入關鍵的比較關鍵。如果他們不平等：繼續前進。如果他們是平等的：找到;現在去行走琴絃。

來源

2011-11-13 16:52:33 wildplasser

公平點，但一個大小爲5463458053的數組？ – user866098

具有5463458054元素的數組需要大量的RAM。如果每個元素是一個4字節的整數，那意味着超過5 * 4 = 20Gytes。 –

我沒有說它需要在記憶中完成。它可能是一組磁盤頁面。 – wildplasser

A Judy Array，實現它的C庫可能正是您需要的基礎。這裏的描述它報價：

朱迪是一個C庫，提供國家的最先進的核心技術實現稀疏動態數組。 Judy數組簡單地用空指針聲明爲。 Judy陣列僅在填充時才消耗內存，但如果需要，它可以增長以利用所有可用內存。 Judy的主要優勢是可擴展性，高性能和內存效率。 Judy數組是可擴展的，並且可以擴展到只有機器內存有限的非常大數量的元素。由於 Judy被設計爲一個無界陣列，因此Judy數組的大小爲未預先分配，但隨着數組總數動態增長和收縮。 Judy結合了可擴展性和易用性。 Judy API 通過簡單的插入，檢索和刪除不需要廣泛編程的調用來訪問。調諧和配置不需要（實際上甚至不可能）。此外，排序，搜索，計數和順序訪問功能都嵌入Judy的設計中。

只要開發人員需要動態調整大小的數組，可以使用Judy，關聯數組或關聯數組或簡單易用的界面，不需要返工以進行擴展或收縮。 Judy可以替換許多常見的數據結構，例如數組，稀疏數組，哈希表，B樹，二叉樹，線性列表，跳過列表，其他排序和搜索算法以及計數函數。

來源

2011-11-13 17:44:24

IIRC judy受限制版權保護。 – wildplasser

根據sourceforge頁面，你可以使用它作爲LGPL，這在大多數情況下是很好的。 –

我糾正了。我認爲朱迪仍然被惠普當作人質。但是，也許版權只適用於來源，而不是想法。 – wildplasser

大量的字符串+快速查找+有限的內存---->你想要一個前綴特里，暴擊位樹，或任何其他的家庭（很多不同的名字爲非常相似的東西，例如PATRICIA ..朱迪也是這樣的）。例如參見this。

這些數據結構允許前綴壓縮，所以它們能夠非常有效地存儲大量的字符串（這些字符串在某種程度上必然會有公共前綴）。另外，查找速度非常快。由於常見的big-O符號無法解釋的緩存和分頁效應，它們可能比散列更快或甚至更快，只佔內存的一小部分（儘管根據big-O，除了可能是數組可以擊敗哈希）。

來源

2011-11-13 18:21:15 Damon

您可以使用單個二叉查找樹（AVL/Red-black/...）來包含所有集合中的字符串全部，方法是按字典順序將它們鍵入爲（set_number，string）。你不需要在任何地方明確地存儲集合。例如，定義節點的順序爲樹比較可能看起來像：

function compare_nodes (node1, node2) { 
    if (node1.set_number < node2.set_number) return LESS; 
    if (node1.set_number > node2.set_number) return GREATER; 
    if (node1.string < node2.string) return LESS; 
    if (node1.string > node2.string) return GREATER; 
    return EQUAL; 
}

通過這樣的結構，一些常用的操作是可能的（但也許並不簡單）。

要查找集合set_number中是否存在字符串s，只需在樹中查找（set_number，s）即可完全匹配。

要查找一組set_number所有字符串：以上

function iterate_all_strings_in_set (set_number) { 
    // Traverse the tree from root downwards, looking for the given key. Return 
    // wherever the search ends up, whether it found the value or not. 
    node = lookup_tree_weak(set_number, ""); 

    // tree empty? 
    if (node == null) { 
     return; 
    } 

    // We may have gotten the greatest node from the previous set, 
    // instead of the first node from the set we're interested in. 
    if (node.set_number != set_number) { 
     node = successor(node); 
    } 

    while (node != null && node.set_number == set_number) { 
     do_something_with(node.string); 
     node = successor(node); 
    } 
}

需要O((k+1)*log(n))時間，其中k是set_number串的數量，並且n是所有字符串的數量。

要查找所有組數字與相關聯的至少一個字符串：

function iterate_all_sets() 
{ 
    node = first_node_in_tree(); 

    while (node != null) { 
     current_set = node.set_number; 
     do_something_with(current_set); 

     if (cannot increment current_set) { 
      return; 
     } 

     node = lookup_tree_weak(current_set + 1, ""); 
     if (node.set_number == current_set) { 
      node = successor(node); 
     } 
    } 
}

上述要求O((k+1)*log(n))時間，其中k是套有至少一個串號，並n是所有字符串的數量。

請注意，上面的代碼假定樹在「do_something」調用中未被修改;如果節點被移除，它可能會崩潰。

Addidionally，這是一些真正的C代碼，它演示了這一點，使用my own generic AVL tree implemetation。爲了編譯它，只需從BadVPN源文件夾中複製misc/和structure/文件夾並在其中添加包含路徑就足夠了。

請注意我的AVL樹在其節點中如何不包含任何「數據」，以及它如何不執行任何自己的內存分配。當您需要處理大量數據時，這非常方便。爲了說清楚：下面的程序只有一個malloc()，它是分配節點數組的節點。

#include <stdlib.h> 
#include <stdio.h> 
#include <inttypes.h> 
#include <assert.h> 

#include <structure/BAVL.h> 
#include <misc/offset.h> 

struct value { 
    uint32_t set_no; 
    char str[3]; 
}; 

struct node { 
    uint8_t is_used; 
    struct value val; 
    BAVLNode tree_node; 
}; 

BAVL tree; 

static int value_comparator (void *unused, void *vv1, void *vv2) 
{ 
    struct value *v1 = vv1; 
    struct value *v2 = vv2; 

    if (v1->set_no < v2->set_no) { 
     return -1; 
    } 
    if (v1->set_no > v2->set_no) { 
     return 1; 
    } 

    int c = strcmp(v1->str, v2->str); 
    if (c < 0) { 
     return -1; 
    } 
    if (c > 0) { 
     return 1; 
    } 
    return 0; 
} 

static void random_bytes (unsigned char *out, size_t n) 
{ 
    while (n > 0) { 
     *out = rand(); 
     out++; 
     n--; 
    } 
} 

static void random_value (struct value *out) 
{ 
    random_bytes((unsigned char *)&out->set_no, sizeof(out->set_no)); 

    for (size_t i = 0; i < sizeof(out->str) - 1; i++) { 
     out->str[i] = (uint8_t)32 + (rand() % 94); 
    } 
    out->str[sizeof(out->str) - 1] = '\0'; 
} 

static struct node * find_node (const struct value *val) 
{ 
    // find AVL tree node with an equal value 
    BAVLNode *tn = BAVL_LookupExact(&tree, (void *)val); 
    if (!tn) { 
     return NULL; 
    } 

    // get node pointer from pointer to its value (same as container_of() in Linux kernel) 
    struct node *n = UPPER_OBJECT(tn, struct node, tree_node); 
    assert(n->val.set_no == val->set_no); 
    assert(!strcmp(n->val.str, val->str)); 

    return n; 
} 

static struct node * lookup_weak (const struct value *v) 
{ 
    BAVLNode *tn = BAVL_Lookup(&tree, (void *)v); 
    if (!tn) { 
     return NULL; 
    } 

    return UPPER_OBJECT(tn, struct node, tree_node); 
} 

static struct node * first_node (void) 
{ 
    BAVLNode *tn = BAVL_GetFirst(&tree); 
    if (!tn) { 
     return NULL; 
    } 

    return UPPER_OBJECT(tn, struct node, tree_node); 
} 

static struct node * next_node (struct node *node) 
{ 
    BAVLNode *tn = BAVL_GetNext(&tree, &node->tree_node); 
    if (!tn) { 
     return NULL; 
    } 

    return UPPER_OBJECT(tn, struct node, tree_node); 
} 

size_t num_found; 

static void iterate_all_strings_in_set (uint32_t set_no) 
{ 
    struct value v; 
    v.set_no = set_no; 
    v.str[0] = '\0'; 
    struct node *n = lookup_weak(&v); 

    if (!n) { 
     return; 
    } 

    if (n->val.set_no != set_no) { 
     n = next_node(n); 
    } 

    while (n && n->val.set_no == set_no) { 
     num_found++; // "do_something_with_string" 
     n = next_node(n); 
    } 
} 

static void iterate_all_sets (void) 
{ 
    struct node *node = first_node(); 

    while (node) { 
     uint32_t current_set = node->val.set_no; 
     iterate_all_strings_in_set(current_set); // "do_something_with_set" 

     if (current_set == UINT32_MAX) { 
      return; 
     } 

     struct value v; 
     v.set_no = current_set + 1; 
     v.str[0] = '\0'; 
     node = lookup_weak(&v); 

     if (node->val.set_no == current_set) { 
      node = next_node(node); 
     } 
    } 
} 

int main (int argc, char *argv[]) 
{ 
    size_t num_nodes = 10000000; 

    // init AVL tree, using: 
    // key=(struct node).val, 
    // comparator=value_comparator 
    BAVL_Init(&tree, OFFSET_DIFF(struct node, val, tree_node), value_comparator, NULL); 

    printf("Allocating...\n"); 

    // allocate nodes (missing overflow check...) 
    struct node *nodes = malloc(num_nodes * sizeof(nodes[0])); 
    if (!nodes) { 
     printf("malloc failed!\n"); 
     return 1; 
    } 

    printf("Inserting %zu nodes...\n", num_nodes); 

    size_t num_inserted = 0; 

    // insert nodes, giving them random values 
    for (size_t i = 0; i < num_nodes; i++) { 
     struct node *n = &nodes[i]; 

     // choose random set number and string 
     random_value(&n->val); 

     // try inserting into AVL tree 
     if (!BAVL_Insert(&tree, &n->tree_node, NULL)) { 
      printf("Insert collision: (%"PRIu32", '%s') already exists!\n", n->val.set_no, n->val.str); 
      n->is_used = 0; 
      continue; 
     } 

     n->is_used = 1; 
     num_inserted++; 
    } 

    printf("Looking up...\n"); 

    // lookup all those values 
    for (size_t i = 0; i < num_nodes; i++) { 
     struct node *n = &nodes[i]; 
     struct node *lookup_n = find_node(&n->val); 

     if (n->is_used) { // this node is the only one with this value 

      ASSERT(lookup_n == n) 
     } else { // this node was an insert collision; some other 
       // node must have this value 
      ASSERT(lookup_n != NULL) 
      ASSERT(lookup_n != n) 
     } 
    } 

    printf("Iterating by sets...\n"); 

    num_found = 0; 
    iterate_all_sets(); 
    ASSERT(num_found == num_inserted) 

    printf("Removing all strings...\n"); 

    for (size_t i = 0; i < num_nodes; i++) { 
     struct node *n = &nodes[i]; 
     if (!n->is_used) { // must not remove it it wasn't inserted 
      continue; 
     } 
     BAVL_Remove(&tree, &n->tree_node); 
    } 

    return 0; 
}

來源

2011-11-13 19:22:59

散列大數據集和C實現

回答

相關問題