檢查表字段postgres

的唯一性在某些情況下（大數據集存儲在表中），我需要檢查Postgres表的字段的唯一性。檢查表字段postgres

爲了簡化的目的，讓我們說，我有如下表：

id | name 
-------------- 
1 | david 
2 | catrine 
3 | hmida

，我要檢查的字段名稱的唯一性;其結果將是真正的到目前爲止，我設法使用類似的代碼如下：

select name, count(*) 
from test 
group by name 
having count(*) > 1

放記住，我有一個大的數據集，所以我更喜歡這個由RDBMS來處理，而不是獲取的數據通過適配器（例如psycopg2）。所以我需要儘可能地優化。任何書呆子的想法？

來源

2017-10-10 rachid el kedmiri

你的代碼不工作（它看起來應該）？什麼是問題？ –

查詢需要2分鐘的1000萬行數據集，我需要更快的速度。 –

與您的查詢數據在數據庫端進行處理 - 而不是psycopg2 –

這將是可能更快，但不太可靠的解決方案：

t=# create table t (i int); 
CREATE TABLE 
t=# insert into t select generate_series(1,9,1); 
INSERT 0 9 
t=# insert into t select generate_series(1,999999,1); 
INSERT 0 999999 
t=# insert into t select generate_series(1,9999999,1); 
INSERT 0 9999999

現在查詢：

t=# select i,count(*) from t group by i having count(*) > 1 order by 2 desc,1 limit 1; 
i | count 
---+------- 
1 |  3 
(1 row) 

Time: 7538.476 ms

現在從統計檢查：

t=# analyze t; 
ANALYZE 
Time: 1079.465 ms 
    t=# with fr as (select most_common_vals::text::text[] from pg_stats where tablename = 't' and attname='i') 
    select count(1),i from t join fr on true where i::text = any(most_common_vals) group by i; 
    count | i 
    -------+-------- 
     2 | 94933 
     2 | 196651 
     2 | 242894 
     2 | 313829 
     2 | 501027 
     2 | 757714 
     2 | 778442 
     2 | 896602 
     2 | 929918 
     2 | 979650 
     2 | 999259 
    (11 rows) 

    Time: 3584.582 ms

，最後只是檢查如果不是uniq只存在一個最頻繁的值：

統計在表上收集後

t=# select count(1),i from t where i::text = (select (most_common_vals::text::text[])[1] from pg_stats where tablename = 't' and attname='i') group by i; 
count | i 
-------+------ 
    2 | 1540 
(1 row) 

Time: 1871.907 ms

更新

pg_stats數據modifyed。因此，您有機會獲得數據分配方面的最新彙總統計信息。在我的實例樣本：

t=# delete from t where i = 1540; 
DELETE 2 
Time: 941.684 ms 
t=# select count(1),i from t where i::text = (select (most_common_vals::text::text[])[1] from pg_stats where tablename = 't' and attname='i') group by i; 
count | i 
-------+--- 
(0 rows) 

Time: 1876.136 ms 
t=# analyze t; 
ANALYZE 
Time: 77.108 ms 
t=# select count(1),i from t where i::text = (select (most_common_vals::text::text[])[1] from pg_stats where tablename = 't' and attname='i') group by i; 
count | i 
-------+------- 
    2 | 41377 
(1 row) 

Time: 1878.260 ms

當然

如果依靠更多的則只是一個最頻繁的值，失敗機會減少，但再次 - 這種方法依賴於統計數據「新鮮」。

來源

2017-10-10 09:30:04

你能詳細說明爲什麼你用'不可靠'來描述你的解決方案，在這種情況下這是不可靠的。非常感謝。 –

檢查表字段postgres

回答

相關問題