如何優化PostgreSQL排行榜窗口函數查詢

在我們的API中，我們有一個基本的排名/排行榜功能，其中每個客戶端用戶都有其可以執行的「操作」列表，每個操作都會得到一個分數，並且所有操作都會記錄在「動作」表，然後每個用戶可以要求當前月份的排行榜（每個月排行榜重置）。沒有什麼花哨。如何優化PostgreSQL排行榜窗口函數查詢

我們有兩個表：與用戶表，並用行動表（我已經刪除不相關的列）：

> \d client_users 
              Table "public.client_users" 
     Column   |   Type    |       Modifiers 
------------------------+-----------------------------+----------------------------------------------------------- 
id      | integer      | not null default nextval('client_users_id_seq'::regclass) 
app_id     | integer      | 
user_id    | character varying   | not null 
created_at    | timestamp without time zone | 
updated_at    | timestamp without time zone | 
Indexes: 
    "client_users_pkey" PRIMARY KEY, btree (id) 
    "index_client_users_on_app_id" btree (app_id) 
    "index_client_users_on_user_id" btree (user_id) 
Foreign-key constraints: 
    "client_users_app_id_fk" FOREIGN KEY (app_id) REFERENCES apps(id) 
Referenced by: 
    TABLE "leaderboard_actions" CONSTRAINT "leaderboard_actions_client_user_id_fk" FOREIGN KEY (client_user_id) REFERENCES client_users(id) 

> \d leaderboard_actions 
             Table "public.leaderboard_actions" 
    Column  |   Type    |       Modifiers 
----------------+-----------------------------+------------------------------------------------------------------ 
id    | integer      | not null default nextval('leaderboard_actions_id_seq'::regclass) 
client_user_id | integer      | 
score   | integer      | not null default 0 
created_at  | timestamp without time zone | 
updated_at  | timestamp without time zone | 
Indexes: 
    "leaderboard_actions_pkey" PRIMARY KEY, btree (id) 
    "index_leaderboard_actions_on_client_user_id" btree (client_user_id) 
    "index_leaderboard_actions_on_created_at" btree (created_at) 
Foreign-key constraints: 
    "leaderboard_actions_client_user_id_fk" FOREIGN KEY (client_user_id) REFERENCES client_users(id)

我試圖優化查詢如下：

SELECT 
    cu.user_id, 
    SUM(la.score) AS total_score, 
    rank() OVER (ORDER BY SUM(la.score) DESC) AS ranking 
FROM client_users cu 
JOIN leaderboard_actions la ON cu.id = la.client_user_id 
WHERE cu.app_id = 8 
AND la.created_at BETWEEN '2017-07-01 00:00:00.000000' AND '2017-07-31 23:59:59.999999' 
GROUP BY cu.id 
ORDER BY total_score DESC 
LIMIT 20;

注：client_users.user_id是VARCHAR 「人ID」，該表的連接與client_user.id外鍵（命名也不是很大，我知道：d）

基本上，我要求PostgreSQL給我排名前20位的用戶在當月的個人行爲總分排名。

你可以從查詢計劃中看到的不是那麼快：

Limit (cost=8641.96..8642.05 rows=20 width=52) (actual time=135.544..135.560 rows=20 loops=1) 
Output: cu.user_id, (sum(la.score)), (rank() OVER (?)), cu.id 
-> WindowAgg (cost=8641.96..8841.42 rows=44326 width=52) (actual time=135.543..135.559 rows=20 loops=1) 
     Output: cu.user_id, (sum(la.score)), rank() OVER (?), cu.id 
     -> Sort (cost=8641.96..8664.12 rows=44326 width=44) (actual time=135.538..135.539 rows=20 loops=1) 
      Output: (sum(la.score)), cu.id, cu.user_id 
      Sort Key: (sum(la.score)) DESC 
      Sort Method: quicksort Memory: 1451kB 
      -> HashAggregate (cost=7824.77..7957.75 rows=44326 width=44) (actual time=130.938..133.124 rows=10411 loops=1) 
        Output: sum(la.score), cu.id, cu.user_id 
        Group Key: cu.id 
        -> Hash Join (cost=5858.66..7780.44 rows=44326 width=40) (actual time=50.849..111.346 rows=79382 loops=1) 
         Output: cu.id, cu.user_id, la.score 
         Hash Cond: (la.client_user_id = cu.id) 
         -> Index Scan using index_leaderboard_actions_on_created_at on public.leaderboard_actions la (cost=0.09..1736.77 rows=69494 width=8) (actual time=0.020..33.773 rows=79382 loops=1) 
           Output: la.id, la.client_user_id, la.rule_id, la.score, la.created_at, la.updated_at, la.success 
           Index Cond: ((la.created_at >= '2017-07-01 00:00:00'::timestamp without time zone) AND (la.created_at <= '2017-07-31 23:59:59.999999'::timestamp without time zone)) 
         -> Hash (cost=5572.11..5572.11 rows=81846 width=36) (actual time=50.330..50.330 rows=81859 loops=1) 
           Output: cu.user_id, cu.id 
           Buckets: 131072 Batches: 1 Memory Usage: 6583kB 
           -> Seq Scan on public.client_users cu (cost=0.00..5572.11 rows=81846 width=36) (actual time=0.014..34.539 rows=81859 loops=1) 
            Output: cu.user_id, cu.id 
            Filter: (cu.app_id = 8) 
            Rows Removed by Filter: 46610 
Planning time: 1.276 ms 
Execution time: 136.176 ms 
(26 rows)

爲了讓你的尺寸的想法：

client_users大約有128471行，只有81860通過有針對性的查詢（app_id = 8）
leaderboard_actions在當月有1609992行和79435

任何想法？

謝謝！

來源

2017-07-30 Guybrush Threepwood

不同意你：由於你要求的信息量多，計劃*速度很快。 – joanolo

你得到的計劃實際上比合理快。

可以幫助（但）你的計劃，另一對夫婦的索引：

CREATE INDEX idx_client_users_app_id_user 
    ON client_users(app_id, id, user_id) ; 

CREATE INDEX idx_leaderboard_actions_3 
    ON leaderboard_actions(created_at, client_user_id, score) ;

創建兩個索引後，執行

VACUUM ANALYZE client_users; 
VACUUM ANALYZE leaderboard_actions;

這些指標將允許（最有可能）的查詢只能讀取它們（而不是表client_users或leaderboard_actions）。所有需要的信息已經存在。該計劃應顯示一些Index Only Scan。

您可以在dbfiddle here找到仿真您的方案。 執行時間有30％的提高。您可能會在實際方案中獲得類似的改進。

來源

2017-07-30 17:12:57 joanolo

非常感謝您的意見。它似乎是完美的。我會密切關注寫查詢，看看它們是否放慢速度，但它們不應該太大。 –

維護索引會在所有'INSERT'和'UPDATE's（特別是修改任何索引列的那些）上增加一些開銷。根據你的場景是* read-heavy *還是* write-heavy *，使用這些索引或多或少都會有意義。在第一種情況下，您會注意到全球性的改進，但第二種情況下，開銷並沒有得到回報。 – joanolo

如何優化PostgreSQL排行榜窗口函數查詢

回答

相關問題