2012-10-23 66 views
0

我有以下T_SQL存儲過程,它佔用了將新導入的記錄上的所有進程運行到後端分析套件所需總時間的50%。不幸的是,這些數據需要每次導入,並且隨着數據庫大小的增長而導致瓶頸。在SQL服務器中導入記錄的重複數據刪除

基本上,我們試圖識別記錄中的所有重複項,並只保留其中的一項。

DECLARE @status INT 
SET @status = 3 


DECLARE @contactid INT 
DECLARE @email VARCHAR (100) 


--Contacts 
DECLARE email_cursor CURSOR FOR 
SELECT email FROM contacts WHERE (reference = @reference AND status = 1) GROUP BY email HAVING (COUNT(email) > 1) 
OPEN email_cursor 

FETCH NEXT FROM email_cursor INTO @email 


WHILE @@FETCH_STATUS = 0 
    BEGIN 
     PRINT @email 
     UPDATE contacts SET duplicate = 1, status = @status WHERE email = @email and reference = @reference AND status = 1 
     SELECT TOP 1 @contactid = id FROM contacts where reference = @reference and email = @email AND duplicate = 1 
     UPDATE contacts SET duplicate =0, status = 1 WHERE id = @contactid 
     FETCH NEXT FROM email_cursor INTO @email 
    END 


CLOSE email_cursor 
DEALLOCATE email_cursor 

我已經加入所有的指標,我可以從查詢執行計劃看,但有可能更新整個SP以不同的方式運行,因爲我設法跟別人做的。

回答

3

使用此單個查詢進行重複數據刪除。

;with tmp as (
select * 
     ,rn=row_number() over (partition by email, reference order by id) 
     ,c=count(1) over (partition by email, reference) 
    from contacts 
where status = 1 
) 
update tmp 
    set duplicate = case when rn=1 then 0 else 1 end 
     ,status = case when rn=1 then 1 else 3 end 
where c > 1 
; 

它只會消除重複的記錄where status=1之中,並認爲與同(電子郵件,參考)組合的DUP行。

+0

上述的輕微問題(沒有我認爲需要的FROM聲明)。增加了這一點,它完美的作品。所有的單元測試都是綠色的,所花費的時間減少了90%。 – ChrisBint

+0

@Chris感謝您的糾正!拼接成答案。 – RichardTheKiwi