2012-01-05 50 views
67

如何做到這一點?獲取最高/最低的記錄​​每組

這個問題的前題是「使用等級(@rank:= @rank + 1)與子查詢的複雜查詢 - 它會工作?」因爲我用的行列尋求解決方案,但現在我看到, Bill發佈的解決方案要好得多。

原題:

我想編寫一個查詢,將採取最後的記錄從給定的一些定義的順序各組:

SET @Rank=0; 

select s.* 
from (select GroupId, max(Rank) AS MaxRank 
     from (select GroupId, @Rank := @Rank + 1 AS Rank 
      from Table 
      order by OrderField 
      ) as t 
     group by GroupId) as t 
    join (
     select *, @Rank := @Rank + 1 AS Rank 
     from Table 
     order by OrderField 
    ) as s 
    on t.GroupId = s.GroupId and t.MaxRank = s.Rank 
order by OrderField 

表達@Rank := @Rank + 1通常用於軍銜,但對我來說它在2個子查詢中使用時看起來很可疑,但只初始化一次。它會以這種方式工作嗎?

第二,它將與一個多次評估的子查詢一起工作嗎?像子查詢中的where(或having)子句(另一種方式如何寫上述):

SET @Rank=0; 

select Table.*, @Rank := @Rank + 1 AS Rank 
from Table 
having Rank = (select max(Rank) AS MaxRank 
       from (select GroupId, @Rank := @Rank + 1 AS Rank 
        from Table as t0 
        order by OrderField 
        ) as t 
       where t.GroupId = table.GroupId 
      ) 
order by OrderField 

在此先感謝!

+1

更高級的問題在這裏http://stackoverflow.com/questions/9841093/how-to-writegreatest-n-per-group-type-query-but-with-additional-conditions/9845109#9845109 – TMS 2012-03-25 10:56:05

回答

129

所以你想獲得每組最高的OrderField?我會做這種方式:

SELECT t1.* 
FROM `Table` AS t1 
LEFT OUTER JOIN `Table` AS t2 
    ON t1.GroupId = t2.GroupId AND t1.OrderField < t2.OrderField 
WHERE t2.GroupId IS NULL 
ORDER BY t1.OrderField; // not needed! (note by Tomas) 

編輯由Tomas:如果有更多的記錄與同組內的相同OrderField,你需要確切地說是其中之一,你可能要擴展的條件編輯的

SELECT t1.* 
FROM `Table` AS t1 
LEFT OUTER JOIN `Table` AS t2 
    ON t1.GroupId = t2.GroupId 
     AND (t1.OrderField < t2.OrderField 
     OR (t1.OrderField = t2.OrderField AND t1.Id < t2.Id)) 
WHERE t2.GroupId IS NULL 

端)

換句話說,返回沒有其他行t2具有相同GroupId存在行t1和更大。。當t2.*爲NULL時,表示左外部聯接未找到此匹配項,因此t1在該組中的值最大爲OrderField

沒有排名,沒有子查詢。如果你有一個(GroupId, OrderField)的複合索引,這應該運行得很快並且通過「使用索引」來優化對t2的訪問。


關於性能,請參閱我的回答Retrieving the last record in each group。我嘗試了使用Stack Overflow數據轉儲的子查詢方法和聯接方法。差異是顯着的:在我的測試中,加入方法的運行速度提高了278倍。

重要的是你有正確的索引以獲得最佳結果!

關於使用@Rank變量的方法,它不會像你寫的那樣工作,因爲@Rank的值在查詢處理完第一個表後不會重置爲零。我會告訴你一個例子。

我插入一些虛擬的數據,一個額外的字段爲空,除了在我們所知道的是每組的最大行:

select * from `Table`; 

+---------+------------+------+ 
| GroupId | OrderField | foo | 
+---------+------------+------+ 
|  10 |   10 | NULL | 
|  10 |   20 | NULL | 
|  10 |   30 | foo | 
|  20 |   40 | NULL | 
|  20 |   50 | NULL | 
|  20 |   60 | foo | 
+---------+------------+------+ 

我們可以證明,排名上升至三層爲第一組和六爲第二組,和內查詢這些正確返回:

select GroupId, max(Rank) AS MaxRank 
from (
    select GroupId, @Rank := @Rank + 1 AS Rank 
    from `Table` 
    order by OrderField) as t 
group by GroupId 

+---------+---------+ 
| GroupId | MaxRank | 
+---------+---------+ 
|  10 |  3 | 
|  20 |  6 | 
+---------+---------+ 

現在運行查詢沒有連接條件,迫使所有行的笛卡爾積,我們也獲取所有列:

select s.*, t.* 
from (select GroupId, max(Rank) AS MaxRank 
     from (select GroupId, @Rank := @Rank + 1 AS Rank 
      from `Table` 
      order by OrderField 
      ) as t 
     group by GroupId) as t 
    join (
     select *, @Rank := @Rank + 1 AS Rank 
     from `Table` 
     order by OrderField 
    ) as s 
    -- on t.GroupId = s.GroupId and t.MaxRank = s.Rank 
order by OrderField; 

+---------+---------+---------+------------+------+------+ 
| GroupId | MaxRank | GroupId | OrderField | foo | Rank | 
+---------+---------+---------+------------+------+------+ 
|  10 |  3 |  10 |   10 | NULL | 7 | 
|  20 |  6 |  10 |   10 | NULL | 7 | 
|  10 |  3 |  10 |   20 | NULL | 8 | 
|  20 |  6 |  10 |   20 | NULL | 8 | 
|  20 |  6 |  10 |   30 | foo | 9 | 
|  10 |  3 |  10 |   30 | foo | 9 | 
|  10 |  3 |  20 |   40 | NULL | 10 | 
|  20 |  6 |  20 |   40 | NULL | 10 | 
|  10 |  3 |  20 |   50 | NULL | 11 | 
|  20 |  6 |  20 |   50 | NULL | 11 | 
|  20 |  6 |  20 |   60 | foo | 12 | 
|  10 |  3 |  20 |   60 | foo | 12 | 
+---------+---------+---------+------------+------+------+ 

從上面我們可以看出,每組的最大等級是正確的,但是@Rank繼續增加,因爲它將第二個派生表處理爲7和更高。所以第二個派生表中的等級將永遠不會與第一個派生表中的等級重疊。

您必須添加另一個派生表來強制@Rank在處理兩個表之間重置爲零(並希望優化器不會更改它評估表的順序,否則使用STRAIGHT_JOIN來防止那):

select s.* 
from (select GroupId, max(Rank) AS MaxRank 
     from (select GroupId, @Rank := @Rank + 1 AS Rank 
      from `Table` 
      order by OrderField 
      ) as t 
     group by GroupId) as t 
    join (select @Rank := 0) r -- RESET @Rank TO ZERO HERE 
    join (
     select *, @Rank := @Rank + 1 AS Rank 
     from `Table` 
     order by OrderField 
    ) as s 
    on t.GroupId = s.GroupId and t.MaxRank = s.Rank 
order by OrderField; 

+---------+------------+------+------+ 
| GroupId | OrderField | foo | Rank | 
+---------+------------+------+------+ 
|  10 |   30 | foo | 3 | 
|  20 |   60 | foo | 6 | 
+---------+------------+------+------+ 

但是這個查詢的優化是可怕的。它不能使用任何索引,它會創建兩個臨時表,以困難的方式對它們進行排序,甚至使用連接緩衝區,因爲它在連接臨時表時也不能使用索引。這是EXPLAIN輸出例如:

+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+ 
| id | select_type | table  | type | possible_keys | key | key_len | ref | rows | Extra       | 
+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+ 
| 1 | PRIMARY  | <derived4> | system | NULL   | NULL | NULL | NULL | 1 | Using temporary; Using filesort | 
| 1 | PRIMARY  | <derived2> | ALL | NULL   | NULL | NULL | NULL | 2 |         | 
| 1 | PRIMARY  | <derived5> | ALL | NULL   | NULL | NULL | NULL | 6 | Using where; Using join buffer | 
| 5 | DERIVED  | Table  | ALL | NULL   | NULL | NULL | NULL | 6 | Using filesort     | 
| 4 | DERIVED  | NULL  | NULL | NULL   | NULL | NULL | NULL | NULL | No tables used     | 
| 2 | DERIVED  | <derived3> | ALL | NULL   | NULL | NULL | NULL | 6 | Using temporary; Using filesort | 
| 3 | DERIVED  | Table  | ALL | NULL   | NULL | NULL | NULL | 6 | Using filesort     | 
+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+ 

而使用左外連接我的解決方案優化了好多了。它不使用臨時表,甚至報告"Using index",這意味着它可以僅使用索引解析連接,而不會觸及數據。

+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+ 
| id | select_type | table | type | possible_keys | key  | key_len | ref    | rows | Extra     | 
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+ 
| 1 | SIMPLE  | t1 | ALL | NULL   | NULL | NULL | NULL   | 6 | Using filesort   | 
| 1 | SIMPLE  | t2 | ref | GroupId  | GroupId | 5  | test.t1.GroupId | 1 | Using where; Using index | 
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+ 

您可能會閱讀在他們的博客上聲稱「加入SQL變慢」的人,但這是無稽之談。糟糕的優化會導致SQL變慢。

+0

這可能證明相當有用(對於OP也是如此),但可悲的是,這兩個問題都沒有被回答。 – 2012-01-05 21:20:42

+0

謝謝比爾,這是一個好主意,如何避免隊伍,但...會不會加入緩慢?連接(沒有where子句的限制)將比我的查詢大得多。無論如何,感謝這個主意!但是,在原始問題中,我也會很感興趣,即隊伍是否會以這種方式工作。 – TMS 2012-01-05 23:53:21

+0

謝謝你的出色答案,比爾。但是,如果我使用'@ Rank1'和'@ Rank2',每個子查詢都有一個呢?這能解決問題嗎?這會比你的解決方案更快嗎? – TMS 2012-01-06 06:37:33