SQL魔術 - 查詢不應該花費15小時，但它確實

好吧，所以我有一個真正怪異的MySQL表（900K記錄，總共180 MB），並且我想從子組記錄中提取更高的date_updated並計算加權平均每組。計算運行約15小時，我有一個強烈的感覺，我做錯了。SQL魔術 - 查詢不應該花費15小時，但它確實

首先，滔天表格佈局：

category
element_id
date_updated
value
weight
source_prefix
source_name

只有在這裏關鍵是element_id（BTREE，8K〜獨特的元素）。

和計算過程：

製作哈希爲每個組和子組。

CREATE TEMPORARY TABLE `temp1` (INDEX (`ds_hash`)) 
       SELECT `category`, 
       `element_id`, 
       `source_prefix`, 
       `source_name`, 
       `date_updated`, 
       `value`, 
       `weight`, 
       MD5(CONCAT(`category`, `element_id`, `source_prefix`, `source_name`)) AS `subcat_hash`, 
       MD5(CONCAT(`category`, `element_id`, `date_updated`)) AS `cat_hash` 
       FROM `bigbigtable` WHERE `date_updated` <= '2009-04-28'

我真的不明白這一點做文章用哈希，但它的工作速度更快這種方式。黑暗魔法，我推測。

查找最大日期爲每個小組

CREATE TEMPORARY TABLE `temp2` (INDEX (`subcat_hash`)) 

       SELECT MAX(`date_updated`) AS `maxdate` , `subcat_hash` 
       FROM `temp1` 
       GROUP BY `subcat_hash`;

加入與TEMP2 temp1中找到的加權平均值爲類別

CREATE TEMPORARY TABLE `valuebycats` (INDEX (`category`)) 
      SELECT `temp1`.`element_id`, 
        `temp1`.`category`, 
        `temp1`.`source_prefix`, 
        `temp1`.`source_name`, 
        `temp1`.`date_updated`, 
        AVG(`temp1`.`value`) AS `avg_value`, 
      SUM(`temp1`.`value` * `temp1`.`weight`)/SUM(`weight`) AS `rating` 

      FROM `temp1` LEFT JOIN `temp2` ON `temp1`.`subcat_hash` = `temp2`.`subcat_hash` 
      WHERE `temp2`.`subcat_hash` = `temp1`.`subcat_hash` 
      AND `temp1`.`date_updated` = `temp2`.`maxdate` 

      GROUP BY `temp1`.`cat_hash`;

（現在，我通過它看上去並寫了這一切在我看來，我應該在最後一個查詢中使用INNER JOIN（避免900k * 900k臨時表））。

還是，有沒有正常的方式這樣做？

UPD：一些圖像以供參考：

除去死ImageShack的鏈路

UPD：解釋提出的解決方案：

+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+ 
| id | select_type | table | type | possible_keys | key  | key_len | ref                     | rows | filtered | Extra          | 
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+ 
| 1 | SIMPLE  | cur | ALL | NULL   | NULL  | NULL | NULL                     | 893085 | 100.00 | Using where; Using temporary; Using filesort | 
| 1 | SIMPLE  | next | ref | prefix  | prefix  | 1074 | bigbigtable.cur.source_prefix,bigbigtable.cur.source_name,bigbigtable.cur.element_id |  1 | 100.00 | Using where         | 
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+

來源

2009-05-22 Kuroki Kaze

使用hashses是，其中一個數據庫引擎可以執行聯接的方式之一。你必須編寫自己的基於散列的連接應該非常罕見;這當然看起來不像其中的一個，有一個900k的行表和一些集合。

基於您的評論，這個查詢可能你在找什麼：

SELECT cur.source_prefix, 
     cur.source_name, 
     cur.category, 
     cur.element_id, 
     MAX(cur.date_updated) AS DateUpdated, 
     AVG(cur.value) AS AvgValue, 
     SUM(cur.value * cur.weight)/SUM(cur.weight) AS Rating 
FROM eev0 cur 
LEFT JOIN eev0 next 
    ON next.date_updated < '2009-05-01' 
    AND next.source_prefix = cur.source_prefix 
    AND next.source_name = cur.source_name 
    AND next.element_id = cur.element_id 
    AND next.date_updated > cur.date_updated 
WHERE cur.date_updated < '2009-05-01' 
AND next.category IS NULL 
GROUP BY cur.source_prefix, cur.source_name, 
    cur.category, cur.element_id

本集團執行每個源+類別+元素的計算。

JOIN用於過濾舊條目。它查找稍後的條目，然後WHERE語句過濾掉以後條目存在的行。像這樣的連接可以從（source_prefix，source_name，element_id，date_updated）索引中獲益。

過濾舊條目有很多方法，但是這個方法往往表現得很好。

來源

2009-05-22 10:26:30 Andomar

好了，900K行ISN」一張巨大的桌子，這是相當大的，但你的查詢真的不應該花這麼長時間。

首先要說的是，上述3個陳述中哪個佔用了大部分時間？

我看到的第一個問題是您的第一個查詢。您的WHERE子句不包含索引列。所以這意味着它必須對整個表進行全表掃描。

在「data_updated」列上創建一個索引，然後再次運行該查詢，看看它爲您做了什麼。

如果你不需要散列，只使用它們來利用黑魔法，那麼將它們完全刪除。

編輯：比我更多的SQL-fu的人可能會減少你的整套邏輯到一個SQL語句中，而不使用臨時表。

編輯：我的SQL有點生疏，但你在第三個SQL staement加入兩次？也許它不會有所作爲，但它不應該是：

SELECT temp1.element_id, 
    temp1.category, 
    temp1.source_prefix, 
    temp1.source_name, 
    temp1.date_updated, 
    AVG(temp1.value) AS avg_value, 
    SUM(temp1.value * temp1.weight)/SUM(weight) AS rating 
FROM temp1 LEFT JOIN temp2 ON temp1.subcat_hash = temp2.subcat_hash 
WHERE temp1.date_updated = temp2.maxdate 
GROUP BY temp1.cat_hash;

或

SELECT temp1.element_id, 
    temp1.category, 
    temp1.source_prefix, 
    temp1.source_name, 
    temp1.date_updated, 
    AVG(temp1.value) AS avg_value, 
    SUM(temp1.value * temp1.weight)/SUM(weight) AS rating 
FROM temp1 temp2 
WHERE temp2.subcat_hash = temp1.subcat_hash 
AND temp1.date_updated = temp2.maxdate 
GROUP BY temp1.cat_hash;

來源

2009-05-22 10:19:27 Glen

最後一個。首先是瞬間，第二是約23分鐘。 – 2009-05-22 10:23:08

我可以刪除哈希，但然後查詢將需要無限的時間（好吧，也許不是，但我沒有這樣的耐心，也沒有客戶端）。我想這些散列可以以某種方式被製作成索引。 – 2009-05-22 10:25:43

SQL魔術 - 查詢不應該花費15小時，但它確實

回答

相關問題