2016-09-22 67 views
0

我已閱讀函數式編程背後的概念,這讓我重新考慮自己的做事方式。使用其他表格中的數據填充表格的最快方法

例如,有一個表:

- Client, Date, Trial, Full 
- Client1, 14.11.2012, 1, 1 
- Client1, 06.02.2013, NULL, 1 
- Client1, 27.03.2013, NULL, 1 
- Client1, 15.05.2013, NULL, 1 

表包含數百萬條記錄,50萬客戶。我的目標是將這些數據轉換爲類似客戶端的狀態:

- Client, Date, Status 
- Client1, 14.11.2012, 'Mixed' 
- Client1, 01.12.2012, 'Unprocessed' 
- Client1, 01.01.2012, 'Unprocessed' 
- Client1, 13.01.2013, 'Slept' 
- Client1, 01.02.2013, 'Slept' 
- Client1, 06.02.2013, 'Processed' 
- Client1, 01.03.2013, 'Unprocessed' 
- Client1, 27.03.2013, 'Processed' 
- Client1, 01.04.2013, 'Unprocessed' 
- Client1, 01.05.2013, 'Unprocessed' 
- Client1, 15.05.2013, 'Processed' 
- Client1, 01.06.2013, 'Unprocessed' 
- Client1, 01.07.2013, 'Unprocessed' 
- Client1, 23.07.2013, 'Slept' 
- Client1, 01.08.2013, 'Slept' 
- Client1, 01.09.2013, 'Slept' 
- Client1, 01.10.2013, 'Slept' 
- Client1, 01.11.2013, 'Slept' 
- Client1, 01.12.2013, 'Slept' 
- Client1, 01.01.2014, 'Slept' 
- Client1, 10.01.2014, 'Left' 

轉換的短算法是:

  1. 如果它是第一行和試驗= 1和全= 1,則狀態=「混合」
  2. 如果沒有爲一個月,然後狀態的第一天,沒有數據=「未處理」
  3. 如果在60天過後,並沒有包含全部= 1個,那麼狀態沒有記錄=「睡」
  4. 如果240天過去了,有包含完整= 1個,那麼狀態沒有記錄=「左」
  5. 如果比上月和上地位=「睡」然後 狀態=「第一天睡

有我跳過了很多情況,因爲算法不是問題,而是工具。

爲了SQL內變換數據I使用下面的表達式:

  • ROW_NUMBER()以上(分區由[客戶]由[日期]遞增順序)
  • 滯後([日期],1 )以上(分區由[客戶]爲了通過[日期]降序)
  • DATEADD(天,1,EOMONTH([日期]))
  • 遞歸

我有感覺,它不能是最快的方式來轉換數據,也多步驟(把每個客戶端分開踩)可能是非常有用的,不知道sql有多好。大量額外案例後,我的執行計劃非常龐大。

所以,我的問題是什麼工具是最好的轉換這種數據呢?編程語言可以處理更好的方式嗎?

更新:我準備了請求的SQL代碼。隨意找到的任何問題: http://pastebin.com/3nCdfquG

+0

您可以發佈您的SQL你已經有了?索引數據的性能如何?這聽起來不像SQL對於高效完成應該太難。 – iamdave

+0

只要可以理解你真正想要完成的事情,SQL將執行任何轉換並插入就好。 – ajeh

+0

我只是想知道我的方法是否得到了充分考慮。稍後會放置代碼,也許sql確實足夠了,我的代碼設計不好。 – user1464922

回答

1

以下假設你會一次處理一個客戶端,例如爲客戶報告


我用你提供的數據集,上傳到一個名爲ClientData的表中,應用以下索引可能會略微矯枉過正,因爲它實質上會創建數據副本,但會使事情變得很快:

create nonclustered index ix_CientData_Client_Date 
on dbo.ClientData(Client,Date) 
include (Trial,[Full]) 

我然後根據給定的Client ID創建了一個日期表,從他們的第一個Date值到他們最近的Date值中的較小者+ 240天或今天。

從這張表中,你可以過濾出所有無用的日期。 Join該數據集本身得到以前的ClientData行並處理您的Status邏輯。

由於您沒有包含整套邏輯流程,我已經完成了我所能完成的任務,如果您開始更改內容,請留下ERROR消息。我覺得在這個見真章有用爲什麼case語句不太做什麼,我希望它:

if object_id('tempdb..#ClientJourney') is not null 
drop table #ClientJourney 

declare @Client nvarchar(50) = '0x802B52540027E50211E24949C409C617' 

declare @MinDate date = (select min(Date) 
         from ClientData 
         where Client = @Client 
         ) 
declare @MaxDate date = (select case when dateadd(d,240,max(Date)) > getdate() 
            then getdate() 
            else dateadd(d,240,max(Date)) 
            end 
         from ClientData 
         where Client = @Client 
         ) 

--select max(Date), @MinDate,@MaxDate, datediff(d,max(date),@MaxDate) from ClientData where Client = @Client 


-- Create a table of dates between @MinDate and @MaxDate with a recursive cte 
;with Dates as 
(
select @MinDate as DateValue 
     ,case when datepart(day,@MinDate) = 1 then 1 else 0 end as MonthStart 

union all 

select dateadd(d,1,DateValue) 
     ,case when datepart(day,dateadd(d,1,DateValue)) = 1 then 1 else 0 end as MonthStart 
from Dates 
where DateValue < @MaxDate 
) 
-- Then exclude any that aren't either the first of the month, in the ClientData table or the @MaxDate value 
select row_number() over (order by DateValue) as RowNum 
     ,d.DateValue 
     ,d.MonthStart 
     ,c.Trial 
     ,c.[Full] 
into #ClientJourney 
from Dates d 
    left join ClientData c 
     on(d.DateValue = c.Date 
      and c.Client = @Client 
      ) 
where d.MonthStart = 1 
    or c.Date is not null 
    or d.DateValue = @MaxDate 
option (maxrecursion 0) 


-- Pull that data out, joining to itself to get the previous item of ClientData and then process the Status 
select j.RowNum 
     ,j.DateValue 
     ,j.MonthStart 
     ,j.Trial 
     ,j.[Full] 

     -- Handling of first line in dataset 
     ,case when j.RowNum = 1 
      then case when j.Trial is not null 
          and j.[Full] is not null 
         then 'Mixed' 
        when j.Trial is null 
          and j.[Full] is not null 
         then 'Full' 
        when j.Trial is not null 
          and j.[Full] is null 
         then 'Trial' 
        else 'ERROR1' 
        end 

      -- Handling rest of dataset 
      else case when j.MonthStart = 1             -- First of the month 
         then case when j.Trial is not null          -- WITH client data 
             or j.[Full] is not null 
            then 'Processed' 

           when j.Trial is null           -- WITHOUT client data 
             and j.[Full] is null 
            then case when datediff(d,jp.DateValue,j.DateValue) < 60  -- For less than 60 days 
                then 'Unprocessed' 
               when datediff(d,jp.DateValue,j.DateValue) < 240  -- For less than 240 days 
                then 'Slept' 
               else 'Left' 
               end 
           else 'ERROR2' 
           end 
        else                 -- Rest of the Month 
         case when j.[Full] = 1             -- WITH Full flag 
            then 'Processed' 

           when j.[Full] is null           -- WITHOUT Full flag 
            then case when datediff(d,jp.DateValue,j.DateValue) < 60   -- For less than 60 days 
                then 'Unprocessed' 
               when datediff(d,jp.DateValue,j.DateValue) < 240   -- For less than 240 days 
                then 'Slept' 
               else 'Left' 
               end 
           else 'ERROR3' 
           end 

        end 
      end as Status 
     ,jp.DateValue 
     ,datediff(d,jp.DateValue,j.DateValue) as LastFull 
from #ClientJourney j 
    outer apply (select top 1 DateValue    -- This returns only the most recent ClientData row that occured before the one being selected 
        from #ClientJourney j2 
        where j.RowNum > j2.RowNum 
         and j2.[Full] is not null 
        order by DateValue desc 
       ) jp 


-- Clean up 
if object_id('tempdb..#ClientJourney') is not null 
drop table #ClientJourney 
+0

謝謝。我有你的想法。 – user1464922

相關問題