2011-03-08 113 views
1

我正在嘗試查找自上次創建用戶以來的三個月內創建的用戶數量。所有按國家分組。性能聚合where子句

這裏是一個可行的查詢:

select count(u.id) as numberOfUsers, 
s.state 
from users u 
join states s on u.state_id = s.id 
where u.creationdate > (
select max(u2.creationdate) 
from users u2 
where u2.state_id = s.id 
) - interval '3 months' 
group by s.state 

但是,它需要100秒。有人能給我一個更高性能的嗎?

我希望這工作:

select count(u.id) as numberOfUsers, 
s.state, max(u2.creationdate) as lastCreated 
from users u 
join states s on u.state_id = s.id 
where u.creationdate > lastCreated - interval '3 months' 
group by s.state 

回答

3

這可能由於只是做一個掃描有更好的表現:

select count(*) as numberofusers, 
     state 
from (select id, state_id, creationdate, 
       max(creationdate) over (partition by state_id) - '3 months'::interval as cutoff 
     from users 
    ) x 
    join states on states.id = x.state_id 
where creationdate > cutoff 
group by state 

然而,它會通過大量的工作存儲器的咀嚼做初始窗口聚合。

嗯,也許更多的東西一樣:

with cutoffs as (
    select id, state, 
     (select max(creationdate) 
      from users 
      where users.state_id = states.id) - '3 months'::interval as cutoff 
    from states) 
select count(*) as numberofusers, state 
from users 
    join cutoffs on users.state_id = cutoffs.id 
where users.creationdate > cutoff 
group by state 

這是試圖逗PostgreSQL的去做一個合理分區掃描,但它不是真正的理想。它仍然進行全表掃描,但至少只有一個。通過CTE的輸出迭代並在循環內部發出外部查詢的結果的set-returning函數可能效果最好,因爲這將能夠爲每個狀態使用creationdate索引。

+0

太棒了!我修改了一下你的查詢並獲得了82ms。 (vs 100000ms,這只是幾個數量級) – 2011-03-08 22:53:59

0

你確定查詢的哪一部分很慢嗎?你可以添加索引嗎?我不是Postgres古茹,但我懷疑如果用戶沒有在users.creationdate上編入索引,MAX()函數將不得不進行全表掃描。嗯,它可能必須做一個反正...

這就是說,這裏什麼都不做!

SELECT u.numUsers, s.state FROM 
(SELECT count(id) as numUsers, state_id 
FROM users 
WHERE creationdate > (MAX(creationdate) - interval '3 Months' 
GROUP BY state_id) u 
left join states s on u.state_id = s.state_id 
+0

問題是,它正在爲該狀態下的每個用戶執行一個狀態內所有用戶的全表聚合。而且,這個查詢實際上不起作用,因爲你不能在where子句中進行聚合。 – 2011-03-08 23:17:15

2

出於興趣,下面的查詢如何執行?我對Postgresql如何處理最內層的查詢(狀態表+標量子查詢)特別感興趣。

必須有用戶的複合索引(state_id,creation_date)才能正常工作。

select s2.id 
     ,s2.state 
     ,(select count(*) 
      from users u 
     where u.state_id  = s2.id 
      and u.creationdate > s2.max_date) as numberOfUsers 
    from (select s.id 
       ,s.state 
       ,(select max(u.creationdate) - interval '3 months' 
        from users u 
       where u.state_id = s.id) as max_date 
     from states s 
     ) s2; 

編輯這是該查詢產生的10萬個用戶行對3國的計劃:

Seq Scan on states s (actual time=4.033..13.949 rows=3 loops=1) 
    Buffers: shared hit=1743 
    SubPlan 3 
    -> Aggregate (actual time=4.636..4.636 rows=1 loops=3) 
      Buffers: shared hit=1742 
      InitPlan 2 (returns $2) 
      -> Result (actual time=0.028..0.028 rows=1 loops=3) 
        Buffers: shared hit=12 
        InitPlan 1 (returns $1) 
        -> Limit (actual time=0.022..0.022 rows=1 loops=3) 
          Buffers: shared hit=12 
          -> Index Scan Backward using users_state_id_creationdate_idx on users u (actual time=0.019..0.019 rows=1 loops=3) 
           Index Cond: ((state_id = $0) AND (creationdate IS NOT NULL)) 
           Buffers: shared hit=12 
      -> Bitmap Heap Scan on users u (actual time=1.095..3.693 rows=8425 loops=3) 
       Recheck Cond: ((state_id = $0) AND (creationdate > $2)) 
       Buffers: shared hit=1730 
       -> Bitmap Index Scan on users_state_id_creationdate_idx (actual time=1.017..1.017 rows=8425 loops=3) 
         Index Cond: ((state_id = $0) AND (creationdate > $2)) 
         Buffers: shared hit=107 
Total runtime: 14.017 ms 
+1

表現非常好(當然,我的數據只是隨機的白噪聲)。有100,000個用戶,平均約17ms,而我的解決方案平均約爲180ms(和OP的原版,我沒有耐心等待)。我會將計劃添加到您的答案中,否則將無法讀取。 – araqnid 2011-03-09 13:16:04

+0

@araqnid,太棒了!非常感謝您花時間!我必須說Postgresql正在慢慢變成我從未使用過的最好的數據庫;)我必須找到一個真正的項目,以便儘快使用它。 – Ronnis 2011-03-09 13:57:41

+0

僅供參考:在我們的數據庫中,上述查詢花費了大約3秒,而我發佈的查詢花費了大約80毫秒,對於原始查詢花費了100秒。 (我們的表結構稍微複雜一些,沒有必要的索引來優化這個查詢,我在這裏簡化了一下查詢。) – 2011-03-10 04:29:03

1

這是我用的時間縮短到82MS查詢:

with cutoffs as (
    select max(u.creationdate) as cuttoff, s.id, s.state, 
      from users u 
    join states s on u.state_id = s.id 
group by s.state, s.id) 
select count(*) as numberofusers, state 
from users 
    join cutoffs on users.state_id = cutoffs.id 
where users.creationdate > cutoff 
group by state 

謝謝araqnid。