2009-08-17 73 views
2

我有一張擁有500,000多條記錄的表。 每條記錄​​都有一個LineNumber字段,它不是唯一的,也不是主鍵的一部分。 每條記錄​​都有一個CreatedOn字段。標記重複記錄的T-SQL查詢

我需要更新所有500,000多條記錄來識別重複記錄。

重複記錄由在其CreatedOn字段的最後七天內具有相同LineNumber的記錄定義。

alt text http://i30.tinypic.com/27xq7oz.jpg

在上面4行的圖是一個重複,因爲它發生只有5天,因爲第1行 第6行是不是即使它發生,因爲4行僅四天重複,但第4行本身已經是重複的了,所以第6行只能比第6行之前9天的第1行進行比較,因此第6行不是重複。

我不知道如何更新IsRepeat字段,逐個通過遊標或其他內容遍歷每條記錄。

我不認爲遊標是要走的路,但我堅持使用任何其他可能的解決方案。

我考慮過也許Common Table Expressions可能有幫助,但我沒有經驗與他們,不知道從哪裏開始。

基本上這個相同的過程需要每天在桌子上完成,因爲表格每天都會被截斷和重新填充。表格重新填充後,如果重複或不重複,我必須重新標記每條記錄。

一些援助將不勝感激。

UPDATE

下面是一個腳本來創建一個表並插入測試數據

USE [Test] 
GO 

/****** Object: Table [dbo].[Job] Script Date: 08/18/2009 07:55:25 ******/ 
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Job]') AND type in (N'U')) 
DROP TABLE [dbo].[Job] 
GO 

USE [Test] 
GO 

/****** Object: Table [dbo].[Job] Script Date: 08/18/2009 07:55:25 ******/ 
SET ANSI_NULLS ON 
GO 

SET QUOTED_IDENTIFIER ON 
GO 

IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Job]') AND type in (N'U')) 
BEGIN 
CREATE TABLE [dbo].[Job](
    [JobID] [int] IDENTITY(1,1) NOT NULL, 
    [LineNumber] [nvarchar](20) NULL, 
    [IsRepeat] [bit] NULL, 
    [CreatedOn] [smalldatetime] NOT NULL, 
CONSTRAINT [PK_Job] PRIMARY KEY CLUSTERED 
(
    [JobID] ASC 
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] 
) ON [PRIMARY] 
END 
GO 


SET NOCOUNT ON 

INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-01 07:52:08') 
INSERT INTO dbo.Job VALUES ('1019',NULL,'2009-07-01 08:30:01') 
INSERT INTO dbo.Job VALUES ('1028',NULL,'2009-07-01 09:30:35') 
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-01 10:51:10') 
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-02 09:22:30') 
INSERT INTO dbo.Job VALUES ('1027',NULL,'2009-07-02 10:27:28') 
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-02 11:15:33') 
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-02 13:01:13') 
INSERT INTO dbo.Job VALUES ('1014',NULL,'2009-07-03 12:05:56') 
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-03 13:57:34') 
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-03 15:38:54') 
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-04 16:32:20') 
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-05 13:46:46') 
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-05 15:08:35') 
INSERT INTO dbo.Job VALUES ('1000',NULL,'2009-07-05 15:19:50') 
INSERT INTO dbo.Job VALUES ('1011',NULL,'2009-07-05 16:37:19') 
INSERT INTO dbo.Job VALUES ('1019',NULL,'2009-07-05 17:14:09') 
INSERT INTO dbo.Job VALUES ('1009',NULL,'2009-07-05 20:55:08') 
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-06 08:29:29') 
INSERT INTO dbo.Job VALUES ('1002',NULL,'2009-07-07 11:22:38') 
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-07 12:25:23') 
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-08 09:32:07') 
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-08 09:46:33') 
INSERT INTO dbo.Job VALUES ('1016',NULL,'2009-07-08 10:09:08') 
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-09 10:45:04') 
INSERT INTO dbo.Job VALUES ('1027',NULL,'2009-07-09 11:31:23') 
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-09 13:10:06') 
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-09 15:04:06') 
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-09 17:32:16') 
INSERT INTO dbo.Job VALUES ('1012',NULL,'2009-07-09 19:51:28') 
INSERT INTO dbo.Job VALUES ('1000',NULL,'2009-07-10 15:09:42') 
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-10 16:15:31') 
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-10 21:55:43') 
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-11 08:49:03') 
INSERT INTO dbo.Job VALUES ('1022',NULL,'2009-07-11 16:47:21') 
INSERT INTO dbo.Job VALUES ('1026',NULL,'2009-07-11 18:23:16') 
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-11 19:49:31') 
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-12 11:57:26') 
INSERT INTO dbo.Job VALUES ('1003',NULL,'2009-07-13 08:32:20') 
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-13 09:31:32') 
INSERT INTO dbo.Job VALUES ('1021',NULL,'2009-07-14 09:52:54') 
INSERT INTO dbo.Job VALUES ('1021',NULL,'2009-07-14 11:22:31') 
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-14 11:54:14') 
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-14 15:17:08') 
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-15 13:27:08') 
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-15 14:10:56') 
INSERT INTO dbo.Job VALUES ('1011',NULL,'2009-07-15 15:20:50') 
INSERT INTO dbo.Job VALUES ('1028',NULL,'2009-07-15 15:39:18') 
INSERT INTO dbo.Job VALUES ('1012',NULL,'2009-07-15 16:06:17') 
INSERT INTO dbo.Job VALUES ('1017',NULL,'2009-07-16 11:52:08') 

SET NOCOUNT OFF 
GO 

回答

1

忽略LineNumber爲空。在這種情況下應如何處理IsRepeat?

它適用於測試數據。產量是否足夠高效?

在成對的情況下重複(LineNumber,CreatedOn),隨意選擇一個。(具有最低的JobId)

基本思想:

  1. 獲取所有的JobId對, 是至少七天之餘,通過 行號。
  2. 計算從左側開始超過七天的 行數 ,包括右側在內的 。 (CNT)
  3. 然後我們知道的JobId x不是重複,下不重複是 對與X的左側,CNT = 1
  4. 使用遞歸CTE與第一行的開始每個LineNumber
  5. 遞歸元素使用帶有計數的對來獲取下一行。
  6. 最後更新,將所有IsRepeat設置爲0表示非重複,1表示其他所有內容。

; with AllPairsByLineNumberAtLeast7DaysApart (LineNumber 
      , LeftJobId 
      , RightJobId 
      , BeginCreatedOn 
      , EndCreatedOn) as 
     (select l.LineNumber 
      , l.JobId 
      , r.JobId 
      , dateadd(day, 7, l.CreatedOn) 
      , r.CreatedOn 
     from Job l 
     inner join Job r 
      on l.LineNumber = r.LineNumber 
      and dateadd(day, 7, l.CreatedOn) < r.CreatedOn 
      and l.JobId <> r.JobId) 
    -- Count the number of rows within from BeginCreatedOn 
    -- up to and including EndCreatedOn 
    -- In the case of CreatedOn = EndCreatedOn, 
    -- include only jobId <= jobid, to handle ties in CreatedOn   
    , AllPairsCount(LineNumber, LeftJobId, RightJobId, Cnt) as 
     (select ap.LineNumber, ap.LeftJobId, ap.RightJobId, count(*) 
     from AllPairsByLineNumberAtLeast7DaysApart ap 
     inner join Job j 
      on j.LineNumber = ap.LineNumber 
      and ap.BeginCreatedOn <= j.createdOn 
      and (j.CreatedOn < ap.EndCreatedOn 
       or (j.CreatedOn = ap.EndCreatedOn 
        and j.JobId <= ap.RightJobId)) 
     group by ap.LineNumber, ap.LeftJobId, ap.RightJobId) 
    , Step1 (LineNumber, JobId, CreatedOn, RN) as 
     (select LineNumber, JobId, CreatedOn 
      , row_number() over 
       (partition by LineNumber order by CreatedOn, JobId) 
     from Job) 
    , Results (JobId, LineNumber, CreatedOn) as  
     -- Start with the first rows. 
     (select JobId, LineNumber, CreatedOn 
     from Step1 
     where RN = 1 
     and LineNumber is not null 
     -- get the next row 
     union all 
     select j.JobId, j.LineNumber, j.CreatedOn 
     from Results r 
     inner join AllPairsCount apc on apc.LeftJobId = r.JobId 
     inner join Job j 
      on j.JobId = apc.RightJobId 
      and apc.CNT = 1) 
    update j 
    set IsRepeat = case when R.JobId is not null then 0 else 1 end 
    from Job j 
    left outer join Results r 
     on j.JobId = R.JobId 
    where j.LineNumber is not null 

編輯:

後,我昨晚關掉電腦,我意識到我做了的事情不是他們需要的是更復雜。一個更簡單(和測試數據,稍微更effecient)中查詢:

基本思想:

  1. 生成PotentialStep(FromJobId,ToJobId)這些是對其中如果FromJobId 不是重複,比ToJobId也不是重複。 (第一行由LineNumber上從FromJobId更 超過七天)
  2. 使用遞歸CTE從每個LineNumber上第一的JobId啓動,然後步驟, 使用PontentialSteps,每個非重複的JobId

; with PotentialSteps (FromJobId, ToJobId) as 
    (select FromJobId, ToJobId 
    from (select f.JobId as FromJobId 
      , t.JobId as ToJobId 
      , row_number() over 
       (partition by f.LineNumber order by t.CreatedOn, t.JobId) as RN 
     from Job f 
     inner join Job t 
      on f.LineNumber = t.LineNumber 
      and dateadd(day, 7, f.CreatedOn) < t.CreatedOn) t 
     where RN = 1) 
, NonRepeats (JobId) as 
    (select JobId 
    from (select JobId 
      , row_number() over 
       (partition by LineNumber order by CreatedOn, JobId) as RN 
     from Job) Start 
    where RN = 1 
    union all 
    select J.JobId 
    from NonRepeats NR 
    inner join PotentialSteps PS 
     on NR.JobId = PS.FromJobId 
    inner join Job J 
     on PS.ToJobId = J.JobId) 
update J 
set IsRepeat = case when NR.JobId is not null then 0 else 1 end 
from Job J 
left outer join NonRepeats NR 
on J.JobId = NR.JobId 
where J.LineNumber is not null 
+0

哇!我真的必須得到CTE!像這樣的例子真的推動我的突觸,而我得到使用他們。期待通過其步伐指出這一點。:) – BlackMael 2009-08-18 11:54:02

+0

它也產生一個有趣的執行計劃...對於那些傷心的感興趣。該死,我想我已經很傷心,因爲我看起來已經.. – BlackMael 2009-08-18 11:55:39

+0

它忽略LineNumber IS NULL,但多數民衆贊成沒關係。爲了防止我需要關心,我將IsRepeat留給NULL。在大多數情況下,我認爲我只需要真正需要默認爲FALSE如果LineNumber是NULL – BlackMael 2009-08-18 11:57:39

-2

我並不爲此感到自豪,它使許多假設(如該CreatedOn僅是日期,和(LineNUmber,CreatedOn)是一個關鍵,可能需要一些調整,只適用於測試數據。

換句話說,我創造這更多的是爲了知識的好奇心,而不是因爲我認爲這是一個真正的解決方案。最終選擇可以是根據V4中的行存在的基礎表中設置IsRepeat的更新。在讓人們看到邪惡之前的最後一點 - 人們可以在測試數據中發佈它們不適用的數據集的評論。這可能會變成一個真正的解決方案:

with V1 as (
select t1.LineNumber,t1.CreatedOn,t2.CreatedOn as PrevDate from 
T1 t1 inner join T1 t2 on t1.LineNumber = t2.LineNumber and t1.CreatedOn > t2.CreatedOn and DATEDIFF(DAY,t2.CreatedOn,t1.CreatedOn) < 7 
), V2 as (
select v1.LineNumber,v1.CreatedOn,V1.PrevDate from V1 
union all 
select v1.LineNumber,v1.CreatedOn,v2.PrevDate from v1 inner join v2 on V1.LineNumber = v2.LineNumber and v1.PrevDate = v2.CreatedOn 
), V3 as (
select LineNumber,CreatedOn,MIN(PrevDate) as PrevDate from V2 group by LineNumber,CreatedOn 
), V4 as (
select LineNumber,CreatedOn from V3 where DATEDIFF(DAY,PrevDate,CreatedOn) < 7 
) 
select 
    T1.LineNumber, 
    T1.CreatedOn, 
    CASE WHEN V4.LineNumber is Null then 0 else 1 end as IsRepeat 
from 
    T1 
     left join 
    V4 
     on 
      T1.LineNumber = V4.LineNumber and 
      T1.CreatedOn = V4.CreatedOn 
order by T1.CreatedOn,T1.LineNumber 
option (maxrecursion 7) 
+0

LineNumber上是不是主鍵 CreatedOn的一部分,有一個時間成分,基本上是SMALLDATETIME 我會爲你發佈一些數據很快 我不知道您的查詢是如何工作的呢,但在一個非常有限的數據我已經熟了它似乎工作:) – BlackMael 2009-08-17 18:48:31

+0

-1。這將返回LineNumber的所有行,從第一天開始爲7天,因爲它是非重複的。它不處理第二個非重複行之後的重複。查看OP發佈的測試數據中的LineNumber 1005。只在2009-07-01 10:51:00和2009-07-09 13:10:00創建應該是IsRepeate = False。你有所有的行> = 2009-07-08 09:47:00(當然,07-08來自做整天,但你不應該得到每一個日期以來,作爲一個非重複,除非我誤解了OP。) – 2009-08-18 06:52:24

+0

這是在BlackMaels更新之前發佈的,當時唯一的測試數據是帖子頂部的6行表格。根據這些數據,它返回了正確的結果集。 – 2009-08-18 06:55:17

-1
UPDATE Jobs 
SET Jobs.IsRepeat = 0 -- mark all of them IsRepeat = false 

UPDATE Jobs 
SET Jobs.IsRepeat = 1 
WHERE EXISTS 
    (SELECT TOP 1 i.LineNumber FROM Jobs i WHERE i.LineNumber = Jobs.LineNumber 
    AND i.CreatedOn <> Jobs.CreatedOn and i.CreatedOn BETWEEN Jobs.CreatedOn - 7 
    AND Jobs.CreatedOn) 

注:我希望這有助於你。讓我知道,如果你發現你會遇到一個更大的數據集的差異。

+0

對不起,這不考慮一個作業不重複,如果唯一的其他工作具有相同的行號7天是重複。 – BlackMael 2009-08-18 11:36:37

+0

@BlackMael:你能舉個例子嗎? – shahkalpesh 2009-08-18 16:09:23