2013-05-08 70 views
2

子查詢的性能我有我的數據庫中這兩個表提高Postgres裏

Student Table     Student Semester Table 
| Column  : Type  |  | Column  : Type  | 
|------------|----------|  |------------|----------| 
| student_id : integer |  | student_id : integer |  
| satquan : smallint |  | semester : integer | 
| actcomp : smallint |  | enrolled : boolean | 
| entryyear : smallint |  | major  : text  | 
|-----------------------|  | college : text  | 
           |-----------------------| 

凡student_id數據是在學生表中的唯一密鑰,並在學生學期表的外鍵。學期整數只是第一學期,2表示第二個1,以此類推。

我做的,我想通過自己的entryyear讓學生查詢(有時通過他們的SAT和/或ACT成績),然後讓所有從學生學期表相關數據的那些學生。

目前,我的疑問是這個樣子:

SELECT * FROM student_semester 
WHERE student_id IN(
    SELECT student_id FROM student_semester 
    WHERE student_id IN(
     SELECT student_id FROM student WHERE entryyear = 2006 
    ) AND college = 'AS' AND ... 
) 
ORDER BY student_id, semester; 

但是,這導致相當長時間運行的查詢(400毫秒),當我選擇1k的學生。根據執行計劃,大部分時間都花在做散列連接上。爲了改善這一點,我已經添加satquan,actpcomp和entryyear列於表student_semester。這減少了運行查詢的時間約90%,但會導致大量冗餘數據。有一個更好的方法嗎?

這些是我目前擁有的索引(連同上student_id數據隱含的指標):

CREATE INDEX act_sat_entryyear ON student USING btree (entryyear, actcomp, sattotal) 
CREATE INDEX student_id_major_college ON student_semester USING btree (student_id, major, college) 

查詢計劃

QUERY PLAN 
Hash Join (cost=17311.74..35895.38 rows=81896 width=65) (actual time=121.097..326.934 rows=25680 loops=1) 
    Hash Cond: (public.student_semester.student_id = public.student_semester.student_id) 
    -> Seq Scan on student_semester (cost=0.00..14307.20 rows=698820 width=65) (actual time=0.015..154.582 rows=698820 loops=1) 
    -> Hash (cost=17284.89..17284.89 rows=2148 width=8) (actual time=121.062..121.062 rows=1284 loops=1) 
     Buckets: 1024 Batches: 1 Memory Usage: 51kB 
     -> HashAggregate (cost=17263.41..17284.89 rows=2148 width=8) (actual time=120.708..120.871 rows=1284 loops=1) 
       -> Hash Semi Join (cost=1026.68..17254.10 rows=3724 width=8) (actual time=4.828..119.619 rows=6184 loops=1) 
        Hash Cond: (public.student_semester.student_id = student.student_id) 
        -> Seq Scan on student_semester (cost=0.00..16054.25 rows=42908 width=4) (actual time=0.013..109.873 rows=42331 loops=1) 
          Filter: ((college)::text = 'AS'::text) 
        -> Hash (cost=988.73..988.73 rows=3036 width=4) (actual time=4.801..4.801 rows=3026 loops=1) 
          Buckets: 1024 Batches: 1 Memory Usage: 107kB 
          -> Bitmap Heap Scan on student (cost=71.78..988.73 rows=3036 width=4) (actual time=0.406..3.223 rows=3026 loops=1) 
           Recheck Cond: (entryyear = 2006) 
           -> Bitmap Index Scan on student_act_sat_entryyear_index (cost=0.00..71.03 rows=3036 width=0) (actual time=0.377..0.377 rows=3026 loops=1) 
             Index Cond: (entryyear = 2006) 
Total runtime: 327.708 ms 

我弄錯了那裏不是一個序列掃描在查詢中。由於與大學條件相匹配的行數,我認爲Seq掃描正在完成;當我將其改爲使用較少學生的指數時。來源:https://stackoverflow.com/a/5203827/880928

查詢與entryyear列包括學生學期表

SELECT * FROM student_semester 
WHERE student_id IN(
    SELECT student_id FROM student_semester 
    WHERE entryyear = 2006 AND collgs = 'AS' 
) ORDER BY student_id, semester; 

查詢計劃

Sort (cost=18597.13..18800.49 rows=81343 width=65) (actual time=72.946..74.003 rows=25680 loops=1) 
    Sort Key: public.student_semester.student_id, public.student_semester.semester 
    Sort Method: quicksort Memory: 3546kB 
    -> Nested Loop (cost=9843.87..11962.91 rows=81343 width=65) (actual time=24.617..40.751 rows=25680 loops=1) 
     -> HashAggregate (cost=9843.87..9845.73 rows=186 width=4) (actual time=24.590..24.836 rows=1284 loops=1) 
       -> Bitmap Heap Scan on student_semester (cost=1612.75..9834.63 rows=3696 width=4) (actual time=10.401..23.637 rows=6184 loops=1) 
        Recheck Cond: (entryyear = 2006) 
        Filter: ((collgs)::text = 'AS'::text) 
        -> Bitmap Index Scan on entryyear_act_sat_semester_enrolled_cumdeg_index (cost=0.00..1611.82 rows=60192 width=0) (actual time=10.259..10.259 rows=60520 loops=1) 
          Index Cond: (entryyear = 2006) 
     -> Index Scan using student_id_index on student_semester (cost=0.00..11.13 rows=20 width=65) (actual time=0.003..0.010 rows=20 loops=1284) 
       Index Cond: (student_id = public.student_semester.student_id) 
Total runtime: 74.938 ms 
+0

請使用'explain analyze'和表中定義的任何索引來發布執行計劃。更多關於在這裏發佈這樣的問題:https://wiki.postgresql.org/wiki/Slow_Query_Questions – 2013-05-08 16:48:47

+0

當要求性能優化時,您還必須提供您的Postgres版本。應該不用說。閱讀[標籤信息postgresql性能](http://stackoverflow.com/tags/postgresql-performance/info) – 2013-05-08 16:50:55

+0

@ErwinBrandstetter我沒有發佈Postgres的版本,因爲我認爲這是更多的通用數據庫模式/查詢策略問題,但我將添加版本以及查詢計劃。 – cmorse 2013-05-08 17:05:01

回答

1

您查詢的乾淨版本

select ss.* 
from 
    student s 
    inner join 
    student_semester ss using(student_id) 
where 
    s.entryyear = 2006 
    and exists (
     select 1 
     from student_semester 
     where 
      college = 'AS' 
      and student_id = s.student_id 
    ) 
order by ss.student_id, semester 
+0

如果有索引涵蓋了student.entryyear和student_semester.college以及student_semester.semester,我希望這會表現良好。另一方面,如果student_semester.semester中只有2個值,那*會令人討厭。 EXPLAIN ANALYSE將講述整個故事。 – 2013-05-08 15:51:57

+0

這不是同一個查詢。這隻會返回'AS'大學的行。原始查詢返回曾經進入'AS'大學的學生的記錄。 – 2013-05-08 16:09:41

+0

@戈登我不明白_who曾經在你評論的'AS'college_部分。 – 2013-05-08 16:18:56

0

你想,它的出現,誰進入2006年,誰的學生有過一直在AS大學。

的一個版本。

SELECT sem.* 
FROM student s JOIN student_semester sem USING (student_id) 
WHERE s.entry_year=2006 
    AND student_id IN (SELECT student_id 
         FROM student_semester s2 WHERE s2.college='AS') 
    AND /* other criteria */ 
ORDER BY sem.student_id, semester; 

版兩個

SELECT sem.* 
FROM student s JOIN student_semester sem USING (student_id) 
WHERE s.entry_year=2006 
    AND EXISTS 
     (SELECT 1 FROM student_semester s2 
      WHERE s2.student_id = s.student_id AND s2.college='AS') 
      -- CREATE INDEX foo on student_semester(student_id, college); 
    AND /* other criteria */ 
ORDER BY sem.student_id, semester; 

我希望既要快,但一個他們是否執行比其他(或完全一樣的方案)更好的是PG謎。

[編輯]這是一個沒有半連接的版本。我不希望它能很好地工作,因爲每次學生在AS時都會有多次擊中。

SELECT DISTINCT ON (/* PK of sem */) 
FROM student s 
    JOIN student_semester sem USING (student_id) 
    JOIN student_semester s2 USING (student_id) 
WHERE s.entry_year=2006 
    AND s2.college='AS' 
ORDER BY sem.student_id, semester; 
+0

這些實際上都不比原始查詢執行得更好。這是查詢計劃。版本1:http://pastebin.com/zXafx0ct,版本二:http://pastebin.com/vntd96dU – cmorse 2013-05-10 22:54:51

+0

這很令人失望。我還有一個可能性是在編輯中添加的。順便說一下'student_semester'上的索引是什麼? – 2013-05-11 01:26:42

1

的另一種方法做查詢使用窗口功能。

select t.* -- Has the extra NumMatches column. To eliminate it, list the columns you want 
from (select ss.*, 
      sum(case when ss.college = 'AS' and s.entry_year = 206 then 1 else 0 end) over 
        (partition by student_id) as NumMatches 
     from student_semester ss join 
      student s 
      on ss.student_id = s.student_id 
    ) t 
where NumMatches > 0; 

窗口函數通常比加入聚合更快,所以我懷疑這可能表現良好。

+0

這實際上比原始查詢運行速度要慢很多(差不多1秒)。大約需要1秒鐘才能完成。根據查詢計劃,它正在掃描表中的每一行3次(即使它聲稱使用索引)。 – cmorse 2013-05-10 22:08:31

+0

@cmorse。 。 。有趣。我很高興你做了測試。我認爲,查詢中的差異在於,它正在計算所有數據的「NumMatches」,而不是子集。聚合的選擇性克服了(我認爲是)窗口函數的稍微更好的性能。 – 2013-05-10 22:53:50

+0

感謝您發佈此查詢。我從來沒有用窗口函數做過很多。看到它完成很有趣。 – cmorse 2013-05-10 22:58:27