散列函數的複合指數(上(a, b)
的索引)和索引之間的主要區別是:
在另一方面,與上a::bigint << 32 + b
的指數,即,結合的a
和b
的值的64位的單元素索引,則可以當你有兩個a
和b
僅使用它。 some_hash_function(a,b)
上的索引也是如此。
對於哈希值的索引可能有一個很大的優勢,因爲它使得索引小很多,代價是選擇性降低,需要用類似的方法重新檢查條件:
WHERE some_hash_function(a,b) = some_hash_function(42,3) AND (a = 42 AND b = 3)
雖然有一個很大的可能性,你忽略考慮:a
和b
兩個單獨的索引。 PostgreSQL可以將它們結合到位圖索引掃描中,或者單獨使用它們,無論哪種方法更適合查詢。這通常是兩個鬆散相關且最不相關的值的最佳選擇。
給出的例子:
CREATE TABLE demoab(a integer, b integer);
INSERT INTO demoab(a, b)
SELECT a, b from generate_series(1,1000) a
CROSS JOIN generate_series(1,1000) b;
CREATE INDEX demoab_a ON demoab(a);
CREATE INDEX demoab_b ON demoab(b);
CREATE INDEX demoab_ab ON demoab(a,b);
CREATE INDEX demoab_ab_shifted ON demoab ((a::bigint << 32 + b));
ANALYZE demoab;
CREATE TABLE demob AS SELECT DISTINCT b FROM demoab ;
CREATE TABLE demoa AS SELECT DISTINCT a FROM demoab ;
ALTER TABLE demoa ADD PRIMARY KEY (a);
ALTER TABLE demob ADD PRIMARY KEY (b);
不同的查詢方法:
regress=> explain analyze SELECT * FROM demoab WHERE a = 42 AND b = 3;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using demoab_ab on demoab (cost=0.00..8.38 rows=1 width=8) (actual time=0.034..0.036 rows=1 loops=1)
Index Cond: ((a = 42) AND (b = 3))
Total runtime: 0.088 ms
(3 rows)
regress=> explain analyze SELECT * FROM demoab WHERE b = 3;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on demoab (cost=19.85..2358.66 rows=967 width=8) (actual time=1.089..4.636 rows=1000 loops=1)
Recheck Cond: (b = 3)
-> Bitmap Index Scan on demoab_b (cost=0.00..19.61 rows=967 width=0) (actual time=0.661..0.661 rows=1000 loops=1)
Index Cond: (b = 3)
Total runtime: 4.820 ms
(5 rows)
regress=> explain analyze SELECT * FROM demoab WHERE a = 42;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Index Scan using demoab_a on demoab (cost=0.00..37.19 rows=962 width=8) (actual time=0.155..0.751 rows=1000 loops=1)
Index Cond: (a = 42)
Total runtime: 0.929 ms
(3 rows)
regress=> explain analyze SELECT * FROM demoab WHERE (a::bigint << 32 + b) = (42::bigint << 32 + 3);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on demoab (cost=4.69..157.67 rows=41 width=8) (actual time=0.260..0.495 rows=94 loops=1)
Recheck Cond: (((a)::bigint << (32 + b)) = 1443109011456::bigint)
-> Bitmap Index Scan on demoab_ab_shifted (cost=0.00..4.67 rows=41 width=0) (actual time=0.232..0.232 rows=94 loops=1)
Index Cond: (((a)::bigint << (32 + b)) = 1443109011456::bigint)
Total runtime: 0.584 ms
(5 rows)
在這裏,(a,b)
綜合指數將是一個清晰贏得,因爲它能夠使用索引只掃描到直接獲取元組,但實際上不可能從非索引列中獲取值。正因爲如此,我爲這些測試運行了SET enable_indexscan = off
。
相反出人意料的是,索引大小是相同的:
regress=> SELECT
pg_relation_size('demoab_ab') AS shifted,
pg_relation_size('demoab_ab') AS ab,
pg_relation_size('demoab_a') AS a,
pg_relation_size('demoab_b') AS b;
shifted | ab | a | b
----------+----------+----------+----------
22487040 | 22487040 | 22487040 | 22487040
(1 row)
我期望的單值索引到需要少得多的空間。對齊要求解釋了其中的一些,但對我來說這仍然是一個意想不到的結果。
在加入@wildplasser的情況下詢問:
regress=> EXPLAIN ANALYZE
SELECT demoa.a, demob.b
FROM demoab
INNER JOIN demoa ON (demoa.a = demoab.a)
INNER JOIN demob ON (demob.b = demoab.b)
WHERE demoa.a = 100 AND demob.b = 500;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..24.94 rows=1 width=8) (actual time=0.121..0.126 rows=1 loops=1)
-> Nested Loop (cost=0.00..16.66 rows=1 width=8) (actual time=0.089..0.092 rows=1 loops=1)
-> Index Scan using demoab_ab on demoab (cost=0.00..8.38 rows=1 width=8) (actual time=0.021..0.021 rows=1 loops=1)
Index Cond: ((a = 100) AND (b = 500))
-> Index Scan using demoa_pkey on demoa (cost=0.00..8.27 rows=1 width=4) (actual time=0.062..0.062 rows=1 loops=1)
Index Cond: (a = 100)
-> Index Scan using demob_pkey on demob (cost=0.00..8.27 rows=1 width=4) (actual time=0.029..0.031 rows=1 loops=1)
Index Cond: (b = 500)
Total runtime: 0.203 ms
(9 rows)
顯示,在這種情況下是PostgreSQL寧願在複合指數(A,B)。這會不會是這樣,如果你只是在b
加盟,但:
regress=> EXPLAIN ANALYZE
SELECT demoab.a, demoab.b
FROM demoab
INNER JOIN demob ON (demob.b = demoab.b)
WHERE demob.b = 500;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=19.85..2376.59 rows=967 width=8) (actual time=0.935..3.653 rows=1000 loops=1)
-> Index Scan using demob_pkey on demob (cost=0.00..8.27 rows=1 width=4) (actual time=0.029..0.032 rows=1 loops=1)
Index Cond: (b = 500)
-> Bitmap Heap Scan on demoab (cost=19.85..2358.66 rows=967 width=8) (actual time=0.897..3.123 rows=1000 loops=1)
Recheck Cond: (b = 500)
-> Bitmap Index Scan on demoab_b (cost=0.00..19.61 rows=967 width=0) (actual time=0.436..0.436 rows=1000 loops=1)
Index Cond: (b = 500)
Total runtime: 3.834 ms
(8 rows)
你會注意到,任何功能的散列索引不會在這裏很有用。所以出於這個原因,如果你需要的話,我建議去(a,b)
的綜合指數加上一個二級指數(b)
。
在唯一性方面,你會發現它的信息來看看pg_catalog.pg_stat
。在那裏,你會看到PostgreSQL不會維護單個索引的統計信息,只能在索引的堆列上進行統計。在這種情況下:
regress=> select tablename, attname, n_distinct, correlation
from pg_stats where tablename like 'demo%';
tablename | attname | n_distinct | correlation
-------------------+---------+------------+-------------
demoab | a | 1000 | 1
demoab | b | 1000 | 0.0105023
demoab_ab_shifted | expr | 21593 | 0.0175595
demob | b | -1 | 0.021045
demoa | a | -1 | 0.021045
(5 rows)
它看起來並不像PG會看到散列/組合方式和兩個分立的,獨立的價值觀之間的任何顯著差異。
我會一直知道a和b,而且永遠不會查詢b。所以我想知道哪種模式最有可能強制索引掃描,哪種模式會導致更快的掃描。 – IamIC
你能指點我做一些關於位圖索引掃描的文檔。雖然這不適合我的特定查詢,但我很想看看這個機制是如何工作的。 – IamIC
如果您打算檢索一個{a,b}元組,並且完全知道a和b,則_faster scan_是徒勞的。但是當查詢可能產生多於一個結果行時,結果會有所不同。那麼如何加入另外兩張桌子,其中一張桌子和另一張桌子上的其中一張呢? – wildplasser