2014-11-06 78 views
2

我嘗試列出並組合來自Person表的一些潛在重複項。SQL Server:通過「弱」標準查找和組重複

的模式是這樣的:

Id LastName  OriginalName FirstName 
--------------------------------------------- 
1  Nolte   Huber   Silvia 
2  Nolte       Johann 
3  Huber       Milan 
4  Huber       Silvia 
5  Abacherli      Adrian 
6  Abächerli      Adrian  
7  Meier       Hans 
8  Meier       Urs 
9  Meyer       Hans 
10 Meier       Urs 
11 Hermann      Marco 
12 Huber       Milan 
13 Meyer       Hans  

預期結果:

GroupNumber Id LastName  OriginalName FirstName 
----------------------------------------------------------- 
1    5  Abacherli      Adrian 
1    6  Abächerli      Adrian 
2    3  Huber       Milan 
2    12 Huber       Milan 
3    4  Huber       Silvia 
3    1  Nolte   Huber   Silvia 
4    7  Meier       Hans 
4    9  Meyer       Hans 
4    13 Meyer       Hans 
5    8  Meier       Urs 
5    10 Meier       Urs 

說明:

我想組行是接近的比賽,並在一個網格列出它們網絡應用程序(ASP.NET MVC)。一個考慮重複至少要有:

  • 相同LastName和同FirstName
  • LastNameOrginalName和同FirstName

爲了讓事情變得更加複雜,「相同」是指語音匹配(即通過SOUNDEX或類似的功能):Meyer == Meier == meier

技術在使用中:

  • 的Microsoft SQL Server 2012
  • Telerik的數據訪問ORM
  • 的.NET Framework 4.5,C#

預期的答案:

  • 純SQL查詢或
  • 存儲過程或
  • 一個SQL查詢/ SP和LINQ查詢的ORM的C#組合

,我迄今制定出來,缺少GroupNumber所有方法。這是這樣的(非工作)查詢:

SELECT 
    Id, LastName, FirstName 
FROM 
    Person p1, 
    (SELECT 
    p1.Id AS Id1 
    FROM Person p1 
    INNER JOIN Person p2 
    ON (p1.LastName LIKE p2.LastName OR p1.LastName LIKE p2.OriginalName) AND p1.FirstName LIKE p2.FirstName AND p1.Id <> p2.Id 
    GROUP BY p1.Id 
    HAVING COUNT(*) > 1) AS p2 
WHERE 
    p1.Id IN (SELECT Id1) 
ORDER BY 
    p1.LastName, FirstName, Id 
+1

[不良習慣踢:使用舊樣式的JOIN(http://sqlblog.com/ blogs/aaron_bertrand/archive/2009/10/08/bad-habits-to-kick-using-old-style-joins.aspx) - 舊式*逗號分隔的table * style列表應該不再是用過的** 相反,建議使用ANSI-** 92 ** SQL標準(超過** 20年前的**)引入的**正確的ANSI JOIN **語法 – 2014-11-06 11:32:13

+0

您是否想過使用Microsoft Data Quality Services( SQL Server的一部分)來識別重複和關閉匹配?這可以輸出到一個可以顯示在網頁表格中的表格嗎? – 2014-11-06 12:35:25

+0

@SteveFord:我快速瀏覽了DQS。就重複的查找和分組而言,它將滿足要求。但似乎沒有API(請參閱[鏈接](http://stackoverflow.com/questions/15293671/use-of-dqs-apis)) - 所以我無法將結果集成到現有的Web應用程序中。 – flo 2014-11-06 12:54:23

回答

1

如何:

SQL Fiddle

MS SQL Server 2012的架構設置

CREATE TABLE Person 
(ID Int, 
    LastName Varchar(50), 
    OriginalName Varchar(50), 
    FirstName varchar(50) 
) 

INSERT INTO Person 
VALUES 
    (1, 'Nolte', 'Huber','Silvia'), 
    (2,'Nolte', '', 'Johann'), 
    (3,'Huber', '', 'Milan'), 
    (4,'Huber', '', 'Silvia'), 
    (5,'Abacherli', '', 'Adrian'), 
    (6,'Abacherli', '', 'Adrian'), 
    (7,'Meier', '', 'Hans'), 
    (8,'Meier', '', 'Urs'), 
    (9,'Meyer', '', 'Hans'), 
    (10,'Meier', '', 'Urs'), 
    (11,'Hermann', '', 'Marco'), 
    (12,'Huber', '', 'Milan'), 
    (13,'Meyer', '', 'Hans') 

查詢1

;WITH PersonCTE 
AS 
(
    SELECT ID, SOUNDEX(LastName) AS LastNameSDX, LastName, OriginalName, SOUNDEX(FirstName) FirstNameSDX, FirstName 
    FROM Person 
    UNION ALL 
    SELECT ID, SOUNDEX(OriginalName) AS LastNameSDX, LastName, OriginalName, SOUNDEX(FirstName) FirstNameSDX, FirstName 
    FROM Person 
    WHERE OriginalName <> '' 
), 
PersonRankCTE 
AS 
(
    SELECT DENSE_RANK() OVER (ORDER BY LastNameSDX, FirstNameSdx) AS Grp, * 
    FROM PersonCTE 
) 
SELECT DENSE_RANK() OVER(ORDER BY grp) AS Grp, ID, LastName, OriginalName, FirstName 
FROM PersonRankCTE P1 
WHERE (SELECT COUNT(*) FROM PersonRankCTE P2 WHERE P1.grp = P2.grp) > 1 

Results

| GRP | ID | LASTNAME | ORIGINALNAME | FIRSTNAME | 
|-----|----|-----------|--------------|-----------| 
| 1 | 5 | Abacherli |    | Adrian | 
| 1 | 6 | Abacherli |    | Adrian | 
| 2 | 3 |  Huber |    |  Milan | 
| 2 | 12 |  Huber |    |  Milan | 
| 3 | 1 |  Nolte |  Huber | Silvia | 
| 3 | 4 |  Huber |    | Silvia | 
| 4 | 13 |  Meyer |    |  Hans | 
| 4 | 9 |  Meyer |    |  Hans | 
| 4 | 7 |  Meier |    |  Hans | 
| 5 | 8 |  Meier |    |  Urs | 
| 5 | 10 |  Meier |    |  Urs | 
0

也許過於複雜(可能?),但...

我做兩個CTE

1讓所有的人場與相應的Soundex姓氏和原始姓名

1創建組並獲取GroupNumber。製作一個UNION ALL得到的,對1 「列」,將 「soundexed」 姓氏和ORIGINALNAME(和只需要重複的)

所以

with cte as (select 
        id, 
        LastName, 
        OriginalName, 
        soundex(LastName) as sdxLastName, 
        soundex(OriginalName) as sdxOriginalName, 
        FirstName 
      from Person), 

    grp as (select lname, FirstName, row_number() over(order by lname) rn 
      from (
        select 
        sdxOriginalName as lname, 
        FirstName from cte 
        where sdxOriginalName is not null 
        union all 
        select 
         sdxLastName as lname, 
         FirstName from cte) s 
       group by lname, FirstName 
       having count(*) > 1) 
select 
    g.rn as GroupNumber, 
    p.Id, 
    p.LastName, 
    p.OriginalName, 
    p.FirstName 
from grp g 
join cte p on p.firstName = g.FirstName and 
    (sdxLastName = g.lname or sdxOriginalName = g.lname) 
order by rn 

看到Sqlfiddle