2014-11-09 30 views
3

我在嘗試加速只有200萬行大約需要11秒的查詢時遇到問題。 Here is a link to my sqlfiddle。這裏是我試圖運行的聲明和我的EXPLAIN聲明。如何加快帶有多個連接的Group By語句?

查詢:

SELECT crawl.pk Pk,domains.domain Domain, 
CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri, 
crawl.redirect Redirect FROM crawl 
LEFT JOIN dates ON crawl.date_crawled=dates.pk  
LEFT JOIN schemes ON crawl.scheme=schemes.pk 
LEFT JOIN domains ON crawl.domain=domains.pk 
LEFT JOIN remainders ON crawl.remainder=remainders.pk 
WHERE (dates.date < CURDATE() - INTERVAL 30 DAY) 
AND crawl.redirect=0 
GROUP BY crawl.domain 
ORDER BY crawl.date_crawled ASC 
LIMIT 50 

說明:

+----+-------------+------------+--------+-----------------------+-----------------------+---------+----------------------------+--------+----------------------------------------------+ 
| id | select_type | table  | type | possible_keys   | key     | key_len | ref      | rows | Extra          | 
+----+-------------+------------+--------+-----------------------+-----------------------+---------+----------------------------+--------+----------------------------------------------+ 
| 1 | SIMPLE  | dates  | ALL | PRIMARY,date   | NULL     | NULL | NULL      |  7 | Using where; Using temporary; Using filesort | 
| 1 | SIMPLE  | crawl  | ref | date_crawled_redirect | date_crawled_redirect | 8  | mytable.dates.pk,const  | 408644 |            | 
| 1 | SIMPLE  | schemes | eq_ref | PRIMARY    | PRIMARY    | 4  | mytable.crawl.scheme  |  1 |            | 
| 1 | SIMPLE  | domains | eq_ref | PRIMARY    | PRIMARY    | 4  | mytable.crawl.domain  |  1 |            | 
| 1 | SIMPLE  | remainders | eq_ref | PRIMARY    | PRIMARY    | 4  | mytable.crawl.remainder |  1 |            | 
+----+-------------+------------+--------+-----------------------+-----------------------+---------+----------------------------+--------+----------------------------------------------+ 
5 rows in set (2.26 sec) 

編輯#1:每評論 正如我已經更換了左加入W /加入並通過加入移動日期過濾器。可悲的是,這並沒有縮短查詢時間。

SELECT crawl.pk Pk,domains.domain Domain, CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri, crawl.redirect Redirect 
FROM crawl 
JOIN schemes ON crawl.scheme=schemes.pk 
JOIN domains ON crawl.domain=domains.pk 
JOIN remainders ON crawl.remainder=remainders.pk 
JOIN dates ON crawl.date_crawled=dates.pk AND dates.date < CURDATE() - INTERVAL 30 DAY 
WHERE crawl.redirect=0 
GROUP BY crawl.domain 
ORDER BY crawl.date_crawled ASC 
LIMIT 50 

EDIT#2: 我更新說明:

+----+-------------+------------+--------+---------------------------------------------------------+-----------------------+---------+----------------------------+--------+-----------------------------------------------------------+ 
| id | select_type | table  | type | possible_keys           | key     | key_len | ref      | rows | Extra              | 
+----+-------------+------------+--------+---------------------------------------------------------+-----------------------+---------+----------------------------+--------+-----------------------------------------------------------+ 
| 1 | SIMPLE  | dates  | range | PRIMARY,date,date_pk,dateBtreeIdx,pk      | date_pk    | 3  | NULL      |  4 | Using where; Using index; Using temporary; Using filesort | 
| 1 | SIMPLE  | crawl  | ref | domain_remainder,remainder,scheme,date_crawled_redirect | date_crawled_redirect | 8  | mytable.dates.pk,const  | 408644 |               | 
| 1 | SIMPLE  | schemes | ALL | PRIMARY             | NULL     | NULL | NULL      |  2 | Using where; Using join buffer       | 
| 1 | SIMPLE  | domains | eq_ref | PRIMARY             | PRIMARY    | 4  | mytable.crawl.domain  |  1 |               | 
| 1 | SIMPLE  | remainders | eq_ref | PRIMARY             | PRIMARY    | 4  | mytable.crawl.remainder |  1 |               | 
+----+-------------+------------+--------+---------------------------------------------------------+-----------------------+---------+----------------------------+--------+-----------------------------------------------------------+ 

EDIT#3

+----+--------------------+------------+-----------------+------------------------------------------+---------+---------+----------------------------+---------+---------------------------------+ 
| id | select_type  | table  | type   | possible_keys       | key  | key_len | ref      | rows | Extra       | 
+----+--------------------+------------+-----------------+------------------------------------------+---------+---------+----------------------------+---------+---------------------------------+ 
| 1 | PRIMARY   | schemes | ALL    | PRIMARY         | NULL | NULL | NULL      |  2 | Using temporary; Using filesort | 
| 1 | PRIMARY   | crawl  | ref    | domain_remainder,remainder,scheme,domain | scheme | 4  | mytable.schemes.pk   | 1448223 | Using where      | 
| 1 | PRIMARY   | domains | eq_ref   | PRIMARY         | PRIMARY | 4  | mytable.crawl.domain  |  1 |         | 
| 1 | PRIMARY   | remainders | eq_ref   | PRIMARY         | PRIMARY | 4  | mytable.crawl.remainder |  1 |         | 
| 2 | DEPENDENT SUBQUERY | dates  | unique_subquery | PRIMARY,date,date_pk,dateBtreeIdx,pk  | PRIMARY | 4  | func      |  1 | Using where      | 
+----+--------------------+------------+-----------------+------------------------------------------+---------+---------+----------------------------+---------+---------------------------------+ 
5 rows in set (0.04 sec) 

EDIT#4:

+----+-------------+------------+--------+--------------------------------------+-------------------------+---------+----------------------------+---------+-----------------------------------------------------------+ 
| id | select_type | table  | type | possible_keys      | key      | key_len | ref      | rows | Extra              | 
+----+-------------+------------+--------+--------------------------------------+-------------------------+---------+----------------------------+---------+-----------------------------------------------------------+ 
| 1 | SIMPLE  | dates  | range | PRIMARY,date,date_pk,dateBtreeIdx,pk | date_pk     | 3  | NULL      |  4 | Using where; Using index; Using temporary; Using filesort | 
| 1 | SIMPLE  | schemes | ALL | PRIMARY        | NULL     | NULL | NULL      |  2 | Using join buffer           | 
| 1 | SIMPLE  | crawl  | ref | scheme_domain_remainder    | scheme_domain_remainder | 4  | mytable.schemes.pk   | 1455517 | Using where            | 
| 1 | SIMPLE  | domains | eq_ref | PRIMARY        | PRIMARY     | 4  | mytable.crawl.domain  |  1 |               | 
| 1 | SIMPLE  | remainders | eq_ref | PRIMARY        | PRIMARY     | 4  | mytable.crawl.remainder |  1 |               | 
+----+-------------+------------+--------+--------------------------------------+-------------------------+---------+----------------------------+---------+-----------------------------------------------------------+ 
5 rows in set (0.04 sec) 

EDIT#5

SELECT urls.pk PK, domains.domain Domain, CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri, urls.redirect Redirect, urls.date_crawled DC FROM 
(SELECT * FROM (
SELECT * FROM crawl as urls ORDER BY date_crawled ASC 
) AS tmp GROUP BY tmp.domain) as urls 
JOIN schemes ON urls.scheme=schemes.pk 
JOIN domains ON urls.domain=domains.pk 
JOIN remainders ON urls.remainder=remainders.pk 
JOIN dates ON urls.date_crawled=dates.pk AND dates.date < CURDATE() - INTERVAL 30 DAY 
WHERE urls.redirect=0 
ORDER BY urls.date_crawled ASC 
LIMIT 50 
+1

只是爲了確認..你要超過30天的所有記錄???或者你的意思是>在過去30天內獲取所有記錄。您的索引以其他方式顯示正常 – DRapp 2014-11-09 01:15:05

+0

50個超過30天的記錄。 – 2014-11-09 01:44:08

+1

你需要所有這些的左連接嗎?並嘗試將條件放在WHERE中作爲你自己的聯接的一部分。 – 2014-11-09 02:09:57

回答

2

您手邊有一個近乎最佳的查詢。唯一的問題是表dates中的非最優索引。正如您在EXPLAIN輸出中看到的,MySQL不能使用表dates中的任何索引,因此它被用作第一個表。這會導致您的表crawl的半優化執行計劃需要訪問大量行。

爲了改善這個你應該在你的dates.date列中添加BTREE指數:

ALTER TABLE dates ADD INDEX dateBtreeIdx USING BTREE (date) 

二叉樹的指數被用於一系列條件。在你的情況下,「低於」,see here

基於此,您可以嘗試將連接字段Dates.pk添加到索引中。這可能會進一步加速您的查詢,但取決於您的數據。

編輯

現在MySQL能夠使用索引上date.dates(類型= RANGE和行= 4)。你沒有看到加速,因爲現在優化器不會使用PRIMARY KEY中的schemes ...

但是,性能問題仍然存在crawl。嘗試不同的方法與IN查詢:

SELECT 
    crawl.pk Pk, domains.domain Domain, 
    CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri, 
    crawl.redirect Redirect 
FROM 
    crawl, schemes, domains, remainders 
WHERE 
    crawl.scheme=schemes.pk 
    AND crawl.domain=domains.pk 
    AND crawl.remainder=remainders.pk 

    AND crawl.date_crawled IN (SELECT pk FROM dates WHERE (dates.date < CURDATE() - INTERVAL 30 DAY)) 
    AND crawl.redirect=0 

GROUP BY 
    crawl.domain 
ORDER BY 
    crawl.date_crawled ASC 
LIMIT 50 

編輯#2

SELECT 
    urls.pk PK, domains.domain Domain, 
    CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri, 
    urls.redirect Redirect, 
    urls.date_crawled DC 
FROM 
    (SELECT pk, redirect, date_crawled FROM crawl GROUP BY `domain`) as urls 
JOIN schemes ON urls.scheme=schemes.pk 
JOIN domains ON urls.`domain`=domains.pk 
JOIN remainders ON urls.remainder=remainders.pk 
JOIN dates ON urls.date_crawled=dates.pk AND dates.date < CURDATE() - INTERVAL 30 DAY 
WHERE 
    urls.redirect=0 
ORDER BY urls.date_crawled ASC 
LIMIT 50 
+0

Thx對於這個建議,但不幸的是它沒有工作 – 2014-11-09 16:33:48

+1

你的'EXPLAIN '現在說? – Benvorth 2014-11-09 16:38:36

+0

查看我的編輯#2。date_pk是建議的新索引。我也單獨嘗試了pk_date和dateBtreeIdx。它們都沒有加速。 – 2014-11-09 16:52:38