使用其他表格中的數據填充表格的最快方法

我已閱讀函數式編程背後的概念，這讓我重新考慮自己的做事方式。使用其他表格中的數據填充表格的最快方法

例如，有一個表：

- Client, Date, Trial, Full 
- Client1, 14.11.2012, 1, 1 
- Client1, 06.02.2013, NULL, 1 
- Client1, 27.03.2013, NULL, 1 
- Client1, 15.05.2013, NULL, 1

表包含數百萬條記錄，50萬客戶。我的目標是將這些數據轉換爲類似客戶端的狀態：

- Client, Date, Status 
- Client1, 14.11.2012, 'Mixed' 
- Client1, 01.12.2012, 'Unprocessed' 
- Client1, 01.01.2012, 'Unprocessed' 
- Client1, 13.01.2013, 'Slept' 
- Client1, 01.02.2013, 'Slept' 
- Client1, 06.02.2013, 'Processed' 
- Client1, 01.03.2013, 'Unprocessed' 
- Client1, 27.03.2013, 'Processed' 
- Client1, 01.04.2013, 'Unprocessed' 
- Client1, 01.05.2013, 'Unprocessed' 
- Client1, 15.05.2013, 'Processed' 
- Client1, 01.06.2013, 'Unprocessed' 
- Client1, 01.07.2013, 'Unprocessed' 
- Client1, 23.07.2013, 'Slept' 
- Client1, 01.08.2013, 'Slept' 
- Client1, 01.09.2013, 'Slept' 
- Client1, 01.10.2013, 'Slept' 
- Client1, 01.11.2013, 'Slept' 
- Client1, 01.12.2013, 'Slept' 
- Client1, 01.01.2014, 'Slept' 
- Client1, 10.01.2014, 'Left'

轉換的短算法是：

如果它是第一行和試驗= 1和全= 1，則狀態=「混合」
如果沒有爲一個月，然後狀態的第一天，沒有數據=「未處理」
如果在60天過後，並沒有包含全部= 1個，那麼狀態沒有記錄=「睡」
如果240天過去了，有包含完整= 1個，那麼狀態沒有記錄=「左」
如果比上月和上地位=「睡」然後狀態=「第一天睡

有我跳過了很多情況，因爲算法不是問題，而是工具。

爲了SQL內變換數據I使用下面的表達式：

ROW_NUMBER（）以上（分區由[客戶]由[日期]遞增順序）
滯後（[日期]，1 ）以上（分區由[客戶]爲了通過[日期]降序）
DATEADD（天，1，EOMONTH（[日期]））
遞歸
等

我有感覺，它不能是最快的方式來轉換數據，也多步驟（把每個客戶端分開踩）可能是非常有用的，不知道sql有多好。大量額外案例後，我的執行計劃非常龐大。

所以，我的問題是什麼工具是最好的轉換這種數據呢？編程語言可以處理更好的方式嗎？

更新：我準備了請求的SQL代碼。隨意找到的任何問題： http://pastebin.com/3nCdfquG

來源

2016-09-22 user1464922

您可以發佈您的SQL你已經有了？索引數據的性能如何？這聽起來不像SQL對於高效完成應該太難。 – iamdave

只要可以理解你真正想要完成的事情，SQL將執行任何轉換並插入就好。 – ajeh

我只是想知道我的方法是否得到了充分考慮。稍後會放置代碼，也許sql確實足夠了，我的代碼設計不好。 – user1464922

以下假設你會一次處理一個客戶端，例如爲客戶報告

我用你提供的數據集，上傳到一個名爲ClientData的表中，應用以下索引可能會略微矯枉過正，因爲它實質上會創建數據副本，但會使事情變得很快：

create nonclustered index ix_CientData_Client_Date 
on dbo.ClientData(Client,Date) 
include (Trial,[Full])

我然後根據給定的Client ID創建了一個日期表，從他們的第一個Date值到他們最近的Date值中的較小者+ 240天或今天。

從這張表中，你可以過濾出所有無用的日期。 Join該數據集本身得到以前的ClientData行並處理您的Status邏輯。

由於您沒有包含整套邏輯流程，我已經完成了我所能完成的任務，如果您開始更改內容，請留下ERROR消息。我覺得在這個見真章有用爲什麼case語句不太做什麼，我希望它：

if object_id('tempdb..#ClientJourney') is not null 
drop table #ClientJourney 

declare @Client nvarchar(50) = '0x802B52540027E50211E24949C409C617' 

declare @MinDate date = (select min(Date) 
         from ClientData 
         where Client = @Client 
         ) 
declare @MaxDate date = (select case when dateadd(d,240,max(Date)) > getdate() 
            then getdate() 
            else dateadd(d,240,max(Date)) 
            end 
         from ClientData 
         where Client = @Client 
         ) 

--select max(Date), @MinDate,@MaxDate, datediff(d,max(date),@MaxDate) from ClientData where Client = @Client 


-- Create a table of dates between @MinDate and @MaxDate with a recursive cte 
;with Dates as 
(
select @MinDate as DateValue 
     ,case when datepart(day,@MinDate) = 1 then 1 else 0 end as MonthStart 

union all 

select dateadd(d,1,DateValue) 
     ,case when datepart(day,dateadd(d,1,DateValue)) = 1 then 1 else 0 end as MonthStart 
from Dates 
where DateValue < @MaxDate 
) 
-- Then exclude any that aren't either the first of the month, in the ClientData table or the @MaxDate value 
select row_number() over (order by DateValue) as RowNum 
     ,d.DateValue 
     ,d.MonthStart 
     ,c.Trial 
     ,c.[Full] 
into #ClientJourney 
from Dates d 
    left join ClientData c 
     on(d.DateValue = c.Date 
      and c.Client = @Client 
      ) 
where d.MonthStart = 1 
    or c.Date is not null 
    or d.DateValue = @MaxDate 
option (maxrecursion 0) 


-- Pull that data out, joining to itself to get the previous item of ClientData and then process the Status 
select j.RowNum 
     ,j.DateValue 
     ,j.MonthStart 
     ,j.Trial 
     ,j.[Full] 

     -- Handling of first line in dataset 
     ,case when j.RowNum = 1 
      then case when j.Trial is not null 
          and j.[Full] is not null 
         then 'Mixed' 
        when j.Trial is null 
          and j.[Full] is not null 
         then 'Full' 
        when j.Trial is not null 
          and j.[Full] is null 
         then 'Trial' 
        else 'ERROR1' 
        end 

      -- Handling rest of dataset 
      else case when j.MonthStart = 1             -- First of the month 
         then case when j.Trial is not null          -- WITH client data 
             or j.[Full] is not null 
            then 'Processed' 

           when j.Trial is null           -- WITHOUT client data 
             and j.[Full] is null 
            then case when datediff(d,jp.DateValue,j.DateValue) < 60  -- For less than 60 days 
                then 'Unprocessed' 
               when datediff(d,jp.DateValue,j.DateValue) < 240  -- For less than 240 days 
                then 'Slept' 
               else 'Left' 
               end 
           else 'ERROR2' 
           end 
        else                 -- Rest of the Month 
         case when j.[Full] = 1             -- WITH Full flag 
            then 'Processed' 

           when j.[Full] is null           -- WITHOUT Full flag 
            then case when datediff(d,jp.DateValue,j.DateValue) < 60   -- For less than 60 days 
                then 'Unprocessed' 
               when datediff(d,jp.DateValue,j.DateValue) < 240   -- For less than 240 days 
                then 'Slept' 
               else 'Left' 
               end 
           else 'ERROR3' 
           end 

        end 
      end as Status 
     ,jp.DateValue 
     ,datediff(d,jp.DateValue,j.DateValue) as LastFull 
from #ClientJourney j 
    outer apply (select top 1 DateValue    -- This returns only the most recent ClientData row that occured before the one being selected 
        from #ClientJourney j2 
        where j.RowNum > j2.RowNum 
         and j2.[Full] is not null 
        order by DateValue desc 
       ) jp 


-- Clean up 
if object_id('tempdb..#ClientJourney') is not null 
drop table #ClientJourney

來源

2016-09-23 18:03:36 iamdave

謝謝。我有你的想法。 – user1464922

使用其他表格中的數據填充表格的最快方法

回答

相關問題