2013-09-30 93 views
1

比方說,我創建了主表中包含的基本聯繫信息和電話號碼的子表中的地址簿 -如何檢測重複記錄與子表中的記錄

Contact 
=============== 
Id   [PK] 
Name 

PhoneNumber 
=============== 
Id   [PK] 
Contact_Id [FK] 
Number 

因此,聯繫人記錄PhoneNumber表中可能有零個或多個相關記錄。對主鍵以外的任何列的唯一性沒有限制。事實上,這必須是真實的,因爲:具有不同的名稱

  1. 兩個觸點可以共享一個電話號碼,並
  2. 兩個觸點可能具有相同的名稱,但不同的電話號碼。

我想將可能包含重複記錄的大型數據集導入到我的數據庫中,然後使用SQL過濾出重複項。用於識別重複記錄的規則很簡單...他們必須共享具有相同內容的相同姓名和相同數量的電話記錄。

當然,這個工作相當有效從聯繫表中選擇重複,但不會幫助我發現給我的規則實際重複:

SELECT * FROM Contact 
WHERE EXISTS 
    (SELECT 'x' FROM Contact t2 
    WHERE t2.Name = Contact.Name AND 
      t2.Id > Contact.Id); 

看起來好像是我要的是一個合乎邏輯的延伸我已經擁有了,但我必須忽略它。任何幫助?

謝謝!

+0

你需要加入兩個表,按名稱分組,然後使用'HAVING'子句來獲得COUNT(Id)> 1'' – mrtig

+0

應該有一個唯一的約束'(PhoneNumber.Contact_Id,PhoneNumber.Number)',雖然。否則,您將面臨多次爲相同聯繫人ID存儲相同編號的風險(順便提一下,導入該大型數據集時可能會使確定重複*集*數據變得更加困難)。 –

+0

Andriy的評論是一個很好的評論。但是,如果公用事業公司希望以最少的驗證提取數據並稍後進行清理,那麼最好創建一組沒有這種限制的緩存表,如他所說,在最終的表上。 – 240DL

回答

0

筆者表示:「兩個人是同一個人」爲要求:

  1. 具有相同的名稱和
  2. 具有相同數目的電話號碼所有這些都是一樣的。

所以這個問題比看起來更復雜一點(或者我可能只是推翻了它)。

樣本數據和(一個醜陋的一個,我知道,但總的想法是有),我測試了下面這似乎是正常工作的測試數據的樣本查詢(我使用Oracle 11g R2):

CREATE TABLE contact (
    id NUMBER PRIMARY KEY, 
    name VARCHAR2(40)) 
; 

CREATE TABLE phone_number (
    id NUMBER PRIMARY KEY, 
    contact_id REFERENCES contact (id), 
    phone VARCHAR2(10) 
); 

INSERT INTO contact (id, name) VALUES (1, 'John'); 
INSERT INTO contact (id, name) VALUES (2, 'John'); 
INSERT INTO contact (id, name) VALUES (3, 'Peter'); 
INSERT INTO contact (id, name) VALUES (4, 'Peter'); 
INSERT INTO contact (id, name) VALUES (5, 'Mike'); 
INSERT INTO contact (id, name) VALUES (6, 'Mike'); 
INSERT INTO contact (id, name) VALUES (7, 'Mike'); 

INSERT INTO phone_number (id, contact_id, phone) VALUES (1, 1, '123'); -- John having number 123 
INSERT INTO phone_number (id, contact_id, phone) VALUES (2, 1, '456'); -- John having number 456 

INSERT INTO phone_number (id, contact_id, phone) VALUES (3, 2, '123'); -- John the second having number 123 
INSERT INTO phone_number (id, contact_id, phone) VALUES (4, 2, '456'); -- John the second having number 456 

INSERT INTO phone_number (id, contact_id, phone) VALUES (5, 3, '123'); -- Peter having number 123 
INSERT INTO phone_number (id, contact_id, phone) VALUES (6, 3, '456'); -- Peter having number 123 
INSERT INTO phone_number (id, contact_id, phone) VALUES (7, 3, '789'); -- Peter having number 123 

INSERT INTO phone_number (id, contact_id, phone) VALUES (8, 4, '456'); -- Peter the second having number 456 

INSERT INTO phone_number (id, contact_id, phone) VALUES (9, 5, '123'); -- Mike having number 456 
INSERT INTO phone_number (id, contact_id, phone) VALUES (10, 5, '456'); -- Mike having number 456 

INSERT INTO phone_number (id, contact_id, phone) VALUES (11, 6, '123'); -- Mike the second having number 456 
INSERT INTO phone_number (id, contact_id, phone) VALUES (12, 6, '789'); -- Mike the second having number 456 

-- Mike the third having no number 
COMMIT; 

-- does not meet the requirements described in the question - will return Peter when it should not 
SELECT DISTINCT c.name 
    FROM contact c JOIN phone_number pn ON (pn.contact_id = c.id) 
GROUP BY name, phone_number 
HAVING COUNT(c.id) > 1 
; 

-- returns correct results for provided test data 
-- take all people that have a namesake in contact table and 
-- take all this person's phone numbers that this person's namesake also has 
-- finally (outer query) check that the number of both persons' phone numbers is the same and 
-- the number of the same phone numbers is equal to the number of (either) person's phone numbers 
SELECT c1_id, name 
    FROM (
    SELECT c1.id AS c1_id, c1.name, c2.id AS c2_id, COUNT(1) AS cnt 
     FROM contact c1 
     JOIN contact c2 ON (c2.id != c1.id AND c2.name = c1.name) 
     JOIN phone_number pn ON (pn.contact_id = c1.id) 
    WHERE 
     EXISTS (SELECT 1 
       FROM phone_number 
       WHERE contact_id = c2.id 
       AND phone = pn.phone) 
    GROUP BY c1.id, c1.name, c2.id 
) 
WHERE cnt = (SELECT COUNT(1) FROM phone_number WHERE contact_id = c1_id) 
    AND (SELECT COUNT(1) FROM phone_number WHERE contact_id = c1_id) = (SELECT COUNT(1) FROM phone_number WHERE contact_id = c2_id) 
; 

-- cleanup 
DROP TABLE phone_number; 
DROP TABLE contact; 

檢查在SQL小提琴:http://www.sqlfiddle.com/#!4/36cdf/1

編輯

答到作者的評論:當然,我並沒有考慮到這一點?這裏有一個修訂的解決方案:

-- new test data 
INSERT INTO contact (id, name) VALUES (8, 'Jane'); 
INSERT INTO contact (id, name) VALUES (9, 'Jane'); 

SELECT c1_id, name 
    FROM (
    SELECT c1.id AS c1_id, c1.name, c2.id AS c2_id, COUNT(1) AS cnt 
     FROM contact c1 
     JOIN contact c2 ON (c2.id != c1.id AND c2.name = c1.name) 
     LEFT JOIN phone_number pn ON (pn.contact_id = c1.id) 
    WHERE pn.contact_id IS NULL 
     OR EXISTS (SELECT 1 
       FROM phone_number 
       WHERE contact_id = c2.id 
       AND phone = pn.phone) 
    GROUP BY c1.id, c1.name, c2.id 
) 
WHERE (SELECT COUNT(1) FROM phone_number WHERE contact_id = c1_id) IN (0, cnt) 
    AND (SELECT COUNT(1) FROM phone_number WHERE contact_id = c1_id) = (SELECT COUNT(1) FROM phone_number WHERE contact_id = c2_id) 
; 

我們允許的情況時,有沒有電話號碼(LEFT JOIN)和外部查詢我們現在比較人的電話號碼的數字 - 它必須是等於0,或從內部查詢返回的編號。

+0

謝謝!我想,這是正確的道路。但是聯繫人記錄的電話記錄數爲零的情況呢? – 240DL

+0

零關聯記錄意味着沒有重複記錄。你不是在尋找重複的東西嗎? –

+0

那麼他們共享相同的名稱,並具有相同數量的電話號碼...如果通過評論作者意味着: 「這個查詢將返回什麼時候會有人有相同的名字,都沒有電話號碼? 「 Then: - 查詢的第一個版本不會返回它們。 - 修改後的查詢會。 –

0

關鍵詞「有」是你的朋友。通用的用途是:

select field1, field2, count(*) records 
from whereever 
where whatever 
group by field1, field2 
having records > 1 

是否可以在having子句中使用別名取決於數據庫引擎。你應該能夠將這個基本原則應用於你的情況。

1

在我的問題中,我創建了一個大大簡化的模式,反映了我正在解決的現實世界問題。 Przemyslaw的回答確實是一個正確的答案,並且按照我對樣本模式以及擴展後的模式進行了詢問。

但是,在對真實模式和較大(〜10k記錄)數據集進行了一些實驗之後,我發現性能是一個問題。我並不聲稱自己是索引大師,但我無法找到比模式中已有索引更好的索引組合。

所以,我想出了一個替代解決方案,它滿足相同的要求,但在一小部分時間內(至少使用SQLite3 - 我的生產引擎)執行一小部分(< 10%)。希望它可以幫助別人,我會提供它作爲我的問題的替代答案。

DROP TABLE IF EXISTS Contact; 
DROP TABLE IF EXISTS PhoneNumber; 

CREATE TABLE Contact (
    Id INTEGER PRIMARY KEY, 
    Name TEXT 
); 

CREATE TABLE PhoneNumber (
    Id   INTEGER PRIMARY KEY, 
    Contact_Id INTEGER REFERENCES Contact (Id) ON UPDATE CASCADE ON DELETE CASCADE, 
    Number  TEXT 
); 

INSERT INTO Contact (Id, Name) VALUES 
    (1, 'John Smith'), 
    (2, 'John Smith'), 
    (3, 'John Smith'), 
    (4, 'Jane Smith'), 
    (5, 'Bob Smith'), 
    (6, 'Bob Smith'); 

INSERT INTO PhoneNumber (Id, Contact_Id, Number) VALUES 
    (1, 1, '555-1212'), 
    (2, 1, '222-1515'), 
    (3, 2, '222-1515'), 
    (4, 2, '555-1212'), 
    (5, 3, '111-2525'), 
    (6, 4, '111-2525'); 

COMMIT; 

SELECT * 
FROM Contact c1 
WHERE EXISTS (
    SELECT 1 
    FROM Contact c2 
    WHERE c2.Id > c1.Id 
    AND c2.Name = c1.Name 
    AND (SELECT COUNT(*) FROM PhoneNumber WHERE Contact_Id = c2.Id) = (SELECT COUNT(*) FROM PhoneNumber WHERE Contact_Id = c1.Id) 
    AND (
     SELECT COUNT(*) 
     FROM PhoneNumber p1 
     WHERE p1.Contact_Id = c2.Id 
     AND EXISTS (
      SELECT 1 
      FROM PhoneNumber p2 
      WHERE p2.Contact_Id = c1.Id 
      AND p2.Number = p1.Number 
     ) 
    ) = (SELECT COUNT(*) FROM PhoneNumber WHERE Contact_Id = c1.Id) 
) 
; 

結果如預期:

Id  Name 
====== ============= 
1  John Smith 
5  Bob Smith 

其他引擎也必然有不同的性能可能是完全可以接受的。這個解決方案似乎對於這個模式的SQLite來說工作得很好。