在排序數據上創建索引

我有一個帶有一些排序數據的文本文件，用換行符分隔。例如：在排序數據上創建索引

... abc123 abc124 abd123 abd124 abd125 ...

現在我想創建一個數據集的索引，它應該（至少）支持：

getStringByIndex（N）：返回的第n項排序列表;
getIndexByString（s）：在所有項目中查找s，返回其索引（如果找不到則返回-1）;

我讀過一些索引算法，如哈希和B樹。具有兒童大小的額外領域的B型樹應該做得很好。但是由於日期排序，我想知道是否有比通過插入所有項目來構建B-Tree更有效的解決方案？

2013-04-04 richselian

我可以創建一個數據結構嗎？說一個具有數組和哈希集的數據結構。插入數組很容易，每次將其插入到數組中時，將該項插入哈希集。當你做getStringByIndex時，使用數組和getIndexByString，使用hashset？ – Calpis 2013-04-05 00:00:39

它更可能是「數據庫」而不是內存結構。索引應存儲在磁盤文件中。 – richselian 2013-04-05 00:16:15

數據集可能大到100GB，無法加載到內存中。否則簡單的二分查找可以解決這個問題。 – richselian 2013-04-05 00:26:55

由於數據是經過排序的，您只需在內存中保留一小段稀疏的數據子集，即可快速有效地定位內容。例如，假設我們決定將每個第N個元素存儲在內存中。爲了有效地初始化您的API，您希望將這個稀疏列表編譯到磁盤上的單獨文件中，因此您不必通過100GB數據流來獲取它。

對於這些術語中的每一個，都需要保存相對於文件開頭的磁盤偏移量。然後，所有你所要做的就是加載稀疏列表/偏移對到內存中，你的兩個請求的實現變得簡單明瞭：

getStringByIndex(n): 
     Get floor(n/N)-th string/offset pair from list 
     Seek offset position in index 
     Read/Skip n mod N strings, then return the next one 

    getIndexByString(s): 
     Binary search over sparse list in memory 
      Locate lower and upper bound string/offset pairs 
     If a string/offset pair is in the i-th position in our sparse list, 
      then the string itself is the (N x i)-th string in our index. 
      We can use this information to compute the return value 
     If the string we want isn't in memory: 
      Seek lower-bound offset in index 
      Read strings until we: 
       a) Find a match 
       b) Reach the high-bound offset 
       c) Reach a string which is lexicographically greater than the one we are looking for 
     Else 
      Just return the index for the matching string in our sparse list

如果您索引中的字符串是固定寬度，可以做出更大的優化。

如果你實現它，你會希望小心你對這個算法的'N'選擇。請記住，從磁盤位置讀取10個字節的成本並不比從相同位置讀取10,000個字節的成本低很多：這是磁盤搜尋的開銷，進出I/O調用傷害最大。

來源

2013-04-05 04:03:02 torquestomp

非常感謝您的提示。我會嘗試一個2級索引（稀疏子集的稀疏子集）。因爲對於100GB和N = 4KB，我必須將25MB索引數據加載到1級內存中，而2級只需要6.25KB。 – richselian 2013-04-06 05:10:25

在排序數據上創建索引

回答

相關問題