2015-01-13 91 views
6

我有一張包含大約25列的員工表。現在有很多重複的東西,我想嘗試去除一些重複的東西。刪除空值較少的重複項

首先,我想通過查找具有相同值的名字,姓氏,員工編號,公司編號和狀態的多個記錄來查找重複項。

SELECT 
    firstname,lastname,employeenumber, companynumber, statusflag 
FROM 
    employeemaster 
GROUP BY 
    firstname,lastname,employeenumber,companynumber, statusflag 
HAVING 
    (COUNT(*) > 1) 

這給了我重複,但我的目標是找到並保留最好的單個記錄和刪除其他記錄。 「最佳單個記錄」由所有其他列中的NULL值最少的記錄定義。我怎樣才能做到這一點?

我正在使用Microsoft SQL Server 2012 MGMT Studio。

例:

enter image description here

紅:DELETE 綠色:KEEP

注:有表中的比該表顯示了更多的列。

+0

您可以編輯的問題樣本記錄和預期產出 –

+0

當然,我加入了一個例子。 – user3788671

+2

一個非常粗糙的方法,但你可以總結每一列中的文本的長度和排序呢?或總結一個IsNull()? – Liath

回答

2

您可以使用sys.columns表來獲取列的列表並構建動態查詢。根據您的給定條件,此查詢將爲您要保留的每條記錄返回一個「KeepThese」值。

-- insert test data 
create table EmployeeMaster 
    (
    Record int identity(1,1), 
    FirstName varchar(50), 
    LastName varchar(50), 
    EmployeeNumber int, 
    CompanyNumber int, 
    StatusFlag int, 
    UserName varchar(50), 
    Branch varchar(50) 
); 
insert into EmployeeMaster 
    (
    FirstName, 
    LastName, 
    EmployeeNumber, 
    CompanyNumber, 
    StatusFlag, 
    UserName, 
    Branch 
) 
    values 
    ('Jake','Jones',1234,1,1,'JJONES','PHX'), 
    ('Jake','Jones',1234,1,1,NULL,'PHX'), 
    ('Jake','Jones',1234,1,1,NULL,NULL), 
    ('Jane','Jones',5678,1,1,'JJONES2',NULL); 

-- get records with most non-null values with dynamic sys.column query 
declare @sql varchar(max) 
select @sql = ' 
    select e.*, 
     row_number() over(partition by 
          e.FirstName, 
          e.LastName, 
          e.EmployeeNumber, 
          e.CompanyNumber, 
          e.StatusFlag 
          order by n.NonNullCnt desc) as KeepThese 
    from EmployeeMaster e 
     cross apply (select count(n.value) as NonNullCnt from (select ' + 
      replace((
       select 'cast(' + c.name + ' as varchar(50)) as value union all select ' 
       from sys.columns c 
       where c.object_id = t.object_id 
       for xml path('') 
       ) + '#',' union all select #','') + ')n)n' 
from sys.tables t 
where t.name = 'EmployeeMaster' 

exec(@sql) 
+0

令人驚歎! :)但是你能解釋一下它在哪裏過濾'count(n.value)'只取非空列的值嗎?首先,我認爲這是通過'cross'應用完成的,但似乎是通過'count'完成的,只考慮'not null'值。我對嗎? – pkuderov

+1

是的,你是對的。每當你計算一個列時,它只計算非空值: select count(value)from(select 1 as value union all select null作爲值)v –

1

試試這個。

;WITH cte 
    AS (SELECT Row_number() 
        OVER(
        partition BY firstname, lastname, employeenumber, companynumber, statusflag 
        ORDER BY (SELECT NULL)) rn, 
       firstname, 
       lastname, 
       employeenumber, 
       companynumber, 
       statusflag, 
       username, 
       branch 
     FROM employeemaster), 
    cte1 
    AS (SELECT a.firstname, 
       a.lastname, 
       a.employeenumber, 
       a.companynumber, 
       a.statusflag, 
       Row_number() 
        OVER(
        partition BY a.firstname, a.lastname, a.employeenumber, a.companynumber, a.statusflag 
        ORDER BY (CASE WHEN a.username IS NULL THEN 1 ELSE 0 END +CASE WHEN a.branch IS NULL THEN 1 ELSE 0 END))rn 
         -- add the remaining columns in case statement 
     FROM cte a 
       JOIN employeemaster b 
        ON a.firstname = b.firstname 
        AND a.lastname = b.lastname 
        AND a.employeenumber = b.employeenumber 
        AND a.companynumbe = b.companynumber 
        AND a.statusflag = b.statusflag) 
SELECT * 
FROM cte1 
WHERE rn = 1 
1

我使用MySQL進行測試,並使用NULL String concat來創建最佳記錄。因爲LENGTH(NULL ||'data')是0.只有當所有列都不爲NULL時纔會有一些長度存在。也許這並不完美。

create table EmployeeMaster 
    (
    Record int auto_increment, 
    FirstName varchar(50), 
    LastName varchar(50), 
    EmployeeNumber int, 
    CompanyNumber int, 
    StatusFlag int, 
    UserName varchar(50), 
    Branch varchar(50), 

    PRIMARY KEY(record) 
); 
INSERT INTO EmployeeMaster 
    (
    FirstName, LastName, EmployeeNumber, CompanyNumber, StatusFlag, UserName, Branch 
) VALUES ('Jake', 'Jones', 1234, 1, 1, 'JJONES', 'PHX'), ('Jake', 'Jones', 1234, 1, 1, NULL, 'PHX'), ('Jake', 'Jones', 1234, 1, 1, NULL, NULL), ('Jane', 'Jones', 5678, 1, 1, 'JJONES2', NULL); 

我的想法查詢看起來像這樣

SELECT e.* 
    FROM employeemaster e 
    JOIN (SELECT firstname, 
       lastname, 
       employeenumber, 
       companynumber, 
       statusflag, 
       MAX(LENGTH (username || branch)) data_quality 
      FROM employeemaster 
     GROUP BY firstname, lastname, employeenumber, companynumber, statusflag 
     HAVING count(*) > 1 
     ) g 
    ON LENGTH (username || branch) = g.data_quality