PostgreSQL內部複合列基數

這不是一個關於空間的問題，但關於索引，這影響了查詢計劃的唯一性。

在基數，該索引的場景是更高的術語：

甲

Table: 
(
    Col1 smallint, 
    Col2 smallint 
)

其中

Range Col1 : 0 - 1000 
Range Col2 : 0 - 1000

和上(Col1, Col2)複合指數，總是在序列查詢。

乙

表：

(
    Col1_2 int 
)

其中

Range Col1_2 : 0 - 1000^2

，並與存儲和組合col1和col2的部件上的查詢(Col1_2)單個索引。

我基本上問的是，將多個小數字組合起來（散列）還是沒有區別？

來源

2013-07-17 IamIC

散列函數的複合指數（上(a, b)的索引）和索引之間的主要區別是：

與複合指數的PostgreSQL可以基於它保持每個單獨的統計信息決定柱;和
在複合索引中，您可以高效地查詢索引a。但是，您可以而不是查詢它只是b。

在另一方面，與上a::bigint << 32 + b的指數，即，結合的a和b的值的64位的單元素索引，則可以當你有兩個a和b僅使用它。 some_hash_function(a,b)上的索引也是如此。

對於哈希值的索引可能有一個很大的優勢，因爲它使得索引小很多，代價是選擇性降低，需要用類似的方法重新檢查條件：

WHERE some_hash_function(a,b) = some_hash_function(42,3) AND (a = 42 AND b = 3)

雖然有一個很大的可能性，你忽略考慮：a和b兩個單獨的索引。 PostgreSQL可以將它們結合到位圖索引掃描中，或者單獨使用它們，無論哪種方法更適合查詢。這通常是兩個鬆散相關且最不相關的值的最佳選擇。

給出的例子：

CREATE TABLE demoab(a integer, b integer); 

INSERT INTO demoab(a, b) 
SELECT a, b from generate_series(1,1000) a 
CROSS JOIN generate_series(1,1000) b; 

CREATE INDEX demoab_a ON demoab(a); 
CREATE INDEX demoab_b ON demoab(b); 
CREATE INDEX demoab_ab ON demoab(a,b); 
CREATE INDEX demoab_ab_shifted ON demoab ((a::bigint << 32 + b)); 
ANALYZE demoab; 

CREATE TABLE demob AS SELECT DISTINCT b FROM demoab ; 
CREATE TABLE demoa AS SELECT DISTINCT a FROM demoab ; 
ALTER TABLE demoa ADD PRIMARY KEY (a); 
ALTER TABLE demob ADD PRIMARY KEY (b);

不同的查詢方法：

regress=> explain analyze SELECT * FROM demoab WHERE a = 42 AND b = 3; 
                QUERY PLAN              
------------------------------------------------------------------------------------------------------------------ 
Index Scan using demoab_ab on demoab (cost=0.00..8.38 rows=1 width=8) (actual time=0.034..0.036 rows=1 loops=1) 
    Index Cond: ((a = 42) AND (b = 3)) 
Total runtime: 0.088 ms 
(3 rows) 


regress=> explain analyze SELECT * FROM demoab WHERE b = 3; 
                 QUERY PLAN              
----------------------------------------------------------------------------------------------------------------------- 
Bitmap Heap Scan on demoab (cost=19.85..2358.66 rows=967 width=8) (actual time=1.089..4.636 rows=1000 loops=1) 
    Recheck Cond: (b = 3) 
    -> Bitmap Index Scan on demoab_b (cost=0.00..19.61 rows=967 width=0) (actual time=0.661..0.661 rows=1000 loops=1) 
     Index Cond: (b = 3) 
Total runtime: 4.820 ms 
(5 rows) 

regress=> explain analyze SELECT * FROM demoab WHERE a = 42; 
                 QUERY PLAN              
----------------------------------------------------------------------------------------------------------------------- 
Index Scan using demoab_a on demoab (cost=0.00..37.19 rows=962 width=8) (actual time=0.155..0.751 rows=1000 loops=1) 
    Index Cond: (a = 42) 
Total runtime: 0.929 ms 
(3 rows) 

regress=> explain analyze SELECT * FROM demoab WHERE (a::bigint << 32 + b) = (42::bigint << 32 + 3); 
                 QUERY PLAN               
---------------------------------------------------------------------------------------------------------------------------- 
Bitmap Heap Scan on demoab (cost=4.69..157.67 rows=41 width=8) (actual time=0.260..0.495 rows=94 loops=1) 
    Recheck Cond: (((a)::bigint << (32 + b)) = 1443109011456::bigint) 
    -> Bitmap Index Scan on demoab_ab_shifted (cost=0.00..4.67 rows=41 width=0) (actual time=0.232..0.232 rows=94 loops=1) 
     Index Cond: (((a)::bigint << (32 + b)) = 1443109011456::bigint) 
Total runtime: 0.584 ms 
(5 rows)

在這裏，(a,b)綜合指數將是一個清晰贏得，因爲它能夠使用索引只掃描到直接獲取元組，但實際上不可能從非索引列中獲取值。正因爲如此，我爲這些測試運行了SET enable_indexscan = off。

相反出人意料的是，索引大小是相同的：

regress=> SELECT 
pg_relation_size('demoab_ab') AS shifted, 
pg_relation_size('demoab_ab') AS ab, 
pg_relation_size('demoab_a') AS a, 
pg_relation_size('demoab_b') AS b; 
shifted | ab | a  | b  
----------+----------+----------+---------- 
22487040 | 22487040 | 22487040 | 22487040 
(1 row)

我期望的單值索引到需要少得多的空間。對齊要求解釋了其中的一些，但對我來說這仍然是一個意想不到的結果。

在加入@wildplasser的情況下詢問：

regress=> EXPLAIN ANALYZE 
SELECT demoa.a, demob.b 
FROM demoab 
INNER JOIN demoa ON (demoa.a = demoab.a) 
INNER JOIN demob ON (demob.b = demoab.b) 
WHERE demoa.a = 100 AND demob.b = 500; 

                  QUERY PLAN               
------------------------------------------------------------------------------------------------------------------------------ 
Nested Loop (cost=0.00..24.94 rows=1 width=8) (actual time=0.121..0.126 rows=1 loops=1) 
    -> Nested Loop (cost=0.00..16.66 rows=1 width=8) (actual time=0.089..0.092 rows=1 loops=1) 
     -> Index Scan using demoab_ab on demoab (cost=0.00..8.38 rows=1 width=8) (actual time=0.021..0.021 rows=1 loops=1) 
       Index Cond: ((a = 100) AND (b = 500)) 
     -> Index Scan using demoa_pkey on demoa (cost=0.00..8.27 rows=1 width=4) (actual time=0.062..0.062 rows=1 loops=1) 
       Index Cond: (a = 100) 
    -> Index Scan using demob_pkey on demob (cost=0.00..8.27 rows=1 width=4) (actual time=0.029..0.031 rows=1 loops=1) 
     Index Cond: (b = 500) 
Total runtime: 0.203 ms 
(9 rows)

顯示，在這種情況下是PostgreSQL寧願在複合指數（A，B）。這會不會是這樣，如果你只是在b加盟，但：

regress=> EXPLAIN ANALYZE 
SELECT demoab.a, demoab.b 
FROM demoab 
INNER JOIN demob ON (demob.b = demoab.b) 
WHERE demob.b = 500; 
                 QUERY PLAN               
----------------------------------------------------------------------------------------------------------------------------- 
Nested Loop (cost=19.85..2376.59 rows=967 width=8) (actual time=0.935..3.653 rows=1000 loops=1) 
    -> Index Scan using demob_pkey on demob (cost=0.00..8.27 rows=1 width=4) (actual time=0.029..0.032 rows=1 loops=1) 
     Index Cond: (b = 500) 
    -> Bitmap Heap Scan on demoab (cost=19.85..2358.66 rows=967 width=8) (actual time=0.897..3.123 rows=1000 loops=1) 
     Recheck Cond: (b = 500) 
     -> Bitmap Index Scan on demoab_b (cost=0.00..19.61 rows=967 width=0) (actual time=0.436..0.436 rows=1000 loops=1) 
       Index Cond: (b = 500) 
Total runtime: 3.834 ms 
(8 rows)

你會注意到，任何功能的散列索引不會在這裏很有用。所以出於這個原因，如果你需要的話，我建議去(a,b)的綜合指數加上一個二級指數(b)。

在唯一性方面，你會發現它的信息來看看pg_catalog.pg_stat。在那裏，你會看到PostgreSQL不會維護單個索引的統計信息，只能在索引的堆列上進行統計。在這種情況下：

regress=> select tablename, attname, n_distinct, correlation 
from pg_stats where tablename like 'demo%'; 
    tablename  | attname | n_distinct | correlation 
-------------------+---------+------------+------------- 
demoab   | a  |  1000 |   1 
demoab   | b  |  1000 | 0.0105023 
demoab_ab_shifted | expr |  21593 | 0.0175595 
demob    | b  |   -1 | 0.021045 
demoa    | a  |   -1 | 0.021045 
(5 rows)

它看起來並不像PG會看到散列/組合方式和兩個分立的，獨立的價值觀之間的任何顯著差異。

來源

2013-07-17 08:19:02

我會一直知道a和b，而且永遠不會查詢b。所以我想知道哪種模式最有可能強制索引掃描，哪種模式會導致更快的掃描。 – IamIC

你能指點我做一些關於位圖索引掃描的文檔。雖然這不適合我的特定查詢，但我很想看看這個機制是如何工作的。 – IamIC

如果您打算檢索一個{a，b}元組，並且完全知道a和b，則_faster scan_是徒勞的。但是當查詢可能產生多於一個結果行時，結果會有所不同。那麼如何加入另外兩張桌子，其中一張桌子和另一張桌子上的其中一張呢？ – wildplasser

如果Col1和Col2是獨立的字段，爲什麼合併？它不會節省任何空間。只要堅持數據庫的原子性原則。

來源

2013-07-17 07:55:03 lulyon

我不問空間。我在詢問影響查詢執行計劃的索引的唯一性。 – IamIC

@lanC單一索引很簡單，因此有助於查詢優化。但是postgresql在支持多維索引方面比其他數據庫要好。我認爲這取決於COl1和Col2如何相關。如果Col1和Col2高度相關，那麼將它們組合起來意味着某種情況，例如，如果Col1和Col2是一個點的（x，y）座標，那麼我建議使用多維索引。如果Col1和COl2的組合意味着什麼，那麼簡單的單一索引兼容性就無需原子性了。 – lulyon

謝謝。您的評論有雙重否定，所以我不瞭解您所做的比較。比方說Col1和Col2是鬆散關聯的（價格，重量），你是說結合它們的唯一好處是查詢簡單嗎？我也認爲查找1比2更快。 – IamIC

PostgreSQL內部複合列基數

回答

相關問題