2014-07-22 122 views
0

我們正在研究大約1300萬行的表格。我們的目標是隻在一個餐廳(〜約300,000行)中查找此表中的重複項。我們的重複標準是姓氏相同,名字相同的前兩個字母,以及相同的電話或電子郵件。這些都是他們自己的專欄。我們現在的策略是爲餐廳的所有行創建兩個相同的臨時表,然後按照上述條件加入它們,然後從第一個表中返回id,名,姓,電話和電子郵件。優化SQL重複搜索

SELECT 
    DISTINCT t1.id, t1.firstname, t1.lastname, t1.phone, t1.email 
FROM 
(
    SELECT lmoc.id, lmoc.firstname, lmoc.lastname, lmoc.phone, lmoc.email 
    FROM loyalty_member_opentable_customer lmoc 
    WHERE lmoc.opentable_restaurant_id=2296 
     AND lmoc.lastname NOT LIKE '%Tour%' 
) AS t1 
INNER JOIN 
(
    SELECT lmoc2.id, lmoc2.firstname, lmoc2.lastname, lmoc2.phone, lmoc2.email 
    FROM loyalty_member_opentable_customer lmoc2 
    WHERE lmoc2.opentable_restaurant_id=2296 
     AND lmoc2.lastname NOT LIKE '%Tour%' 
) AS t2 
    ON STRCMP(t1.lastname,t2.lastname)=0 
    AND t1.id!=t2.id 
    AND STRCMP(LEFT(t1.firstname,2),LEFT(t2.firstname,2))=0 
    AND (STRCMP(t1.phone,t2.phone)=0 OR STRCMP(t1.email,t2.email)=0) 
ORDER BY t1.lastname, t1.firstname 

問題是這個查詢需要48小時的北方運行。任何人都可以想到一個更有效的方法來運行它?我們需要所有重複項目,以便餐廳能夠按照他們認爲合適的方式合併它們。

+2

聽起來像是一個很好的策略。玩的開心。 – Strawberry

+1

這個問題似乎是無關緊要的,因爲沒有問題。 – Strawberry

+0

如果您發佈表結構和SQL查詢,這會很有用。此外,有關當前性能的一些信息將有助於衡量可以改進的地方。嘗試將其重新翻譯爲一個問題。 –

回答

1

爲什麼不能簡單地做

SELECT lmoc.lastname, lmoc.firstname, lmoc.phone, lmoc.email 
FROM loyalty_member_opentable_customer lmoc 
WHERE lmoc.opentable_restaurant_id=2296 
    AND lmoc.lastname NOT LIKE '%Tour%' 
GROUP BY lmoc.lastname, LEFT(lmoc.firstname, 2), lmoc.phone, lmoc.email 
HAVING COUNT(*) > 1; 

+0

這消除了標準的電話或電子郵件匹配方面。有些重複電話有相匹配的電話,有些重複電話有相匹配的電子郵件,但很少有重複的電話。我們也希望擁有兩個重複的ID,以便我們可以將它們組合起來。 – Zak

1

這個SQL將幫助你找到重複的

SELECT lmoc.id, lmoc.firstname, lmoc.lastname, lmoc.phone, lmoc.email 
FROM loyalty_member_opentable_customer lmoc 
WHERE lmoc.opentable_restaurant_id=2296 
    AND lmoc.lastname NOT LIKE '%Tour%' 
    AND lmoc.lastname BETWEEN 'ha' AND 'i' 
GROUP BY lmoc.opentable_restaurant_id, lmoc.id, LEFT(lmoc.firstname,2), lmoc.lastname, lmoc.phone, lmoc.email 
HAVING COUNT(*) > 1  

如果你有一個主鍵,就可以輕鬆地保持最近的一個和清除舊的,這個SQL

DELETE 
     lmoc.primary_id 
FROM loyalty_member_opentable_customer lmoc 
LEFT JOIN 
    (SELECT 
     MAX(lmoc.primary_id) AS id 
    FROM loyalty_member_opentable_customer lmoc 
    WHERE lmoc.opentable_restaurant_id=2296 
     AND lmoc.lastname NOT LIKE '%Tour%' 
     AND lmoc.lastname BETWEEN 'ha' AND 'i' 
    GROUP BY lmoc.opentable_restaurant_id, lmoc.id, LEFT(lmoc.firstname,2), lmoc.lastname, lmoc.phone, lmoc.email 
    ) nodup 
    ON adjuster.id = nodup.id 
WHERE lmoc.opentable_restaurant_id=2296 
     AND lmoc.lastname NOT LIKE '%Tour%' 
     AND lmoc.lastname BETWEEN 'ha' AND 'i' 
     AND nodup.id IS NULL"; 
+0

我沒有'lmoc.lastname BETWEEN'ha'和'i''? – ForguesR

+0

我剛剛從扎克的問題中獲得了WHERE條件。好像他之後編輯它。 –

1

你不是在創建一個臨時表,而是使用子查詢,並且這將會有1300萬行慢。用您需要的全部數據創建一個真正的臨時表(SELECT INTO)。

這是我想嘗試:

/* Creating a temporary table */ 
SELECT lmoc.id, lmoc.firstname, lmoc.lastname, lmoc.phone, lmoc.email 
INTO tempRestaurant 
FROM loyalty_member_opentable_customer AS lmoc 
WHERE 
    lmoc.opentable_restaurant_id=2296 AND 
    lmoc.lastname NOT LIKE '%Tour%' 

/* Select duplicates */ 
SELECT * FROM 
    tempRestaurant AS t1 
INNER JOIN tempRestaurant AS t2 ON 
    STRCMP(t1.lastname,t2.lastname)=0 
    AND t1.id!=t2.id 
WHERE 
    STRCMP(LEFT(t1.firstname,2), LEFT(t2.firstname,2))=0 AND 
    (STRCMP(t1.phone,t2.phone)=0 OR STRCMP(t1.email,t2.email)=0)