2014-02-25 17 views
3

我想用SQL高效地將時間系列拉伸到不同的長度。假設我有如下的數據:在SQL中高效地拉伸時間系列

SQLFiddle (PostgreSQL)

-- drop table if exists time_series; 

create table time_series (
    id serial, 
    val numeric) 
; 

insert into time_series (val) values 
    (1), (2), (3), (4), (5), (6), 
    (5), (4), (3), (2), (1); 

此時間序列具有長度11和我想將其拉伸至長度15以這樣的方式在拉伸時序值的那筆相同原始時間序列中的值總和。我有一個解決方案,它是沒有效率的:

select 
    new_id, 
    sum(new_val) as new_val 
from 
    (
    select 
     id, 
     val/15.0 as new_val, 
     ceil(row_number() over(order by id, gs)/11.0) as new_id 
    from 
     time_series 
     cross join (select generate_series(1, 15) gs) gs 
) raw_data 
group by 
    new_id 
order by 
    new_id 
; 

這將首先創建一個表,15個* 11行,然後摺疊它放回15行。

雖然這適用於小時間序列,但對於較長的時間序列,性能會明顯變差。鑑於我想將2,000行延伸到3,000,比查詢必須先生成6M行(我的筆記本電腦需要30秒)。

測試數據:

insert into time_series (val) select generate_series(1, 1000); 
insert into time_series (val) select generate_series(1000, 1, -1); 

有沒有在SQL更有效的解決方案具有相同的結果?

回答

0

我想通了。伸展具有5種元素的時間序列爲時間序列與30個元素,同時保持值的總和,你可以使用:

with time_series (id, val) as (values 
    (1, 1), 
    (2, 2), 
    (3, 3), 
    (4, 2), 
    (5, 1) 
) 

, mapping_to_old_ts_ids as (
    select 
    gs as new_id, 
    case when mod(((gs - 1) * otsl + 1), ntsl) <> 0 then ((gs - 1) * otsl + 1)/ntsl + 1 else ((gs - 1) * otsl + 1)/ntsl end as old_id_start, 
    case mod(((gs - 1) * otsl + 1), ntsl) when 0 then ntsl else mod(((gs - 1) * otsl + 1), ntsl) end as old_id_start_piece, 
    case when mod((gs * otsl), ntsl) <> 0 then (gs * otsl)/ntsl + 1 else (gs * otsl)/ntsl end as old_id_end, 
    case mod((gs * otsl), ntsl) when 0 then ntsl else mod((gs * otsl), ntsl) end as old_id_end_piece, 
    ntsl 
    from 
    (select generate_series(1, ntsl) as gs, ntsl from (select 30 as ntsl) a) new_time_series 
    cross join (select count(*) as otsl from time_series) old_time_series_length  
) 

select 
    new_id, 
    case 
     when old_id_start = old_id_end then (old_id_end_piece - old_id_start_piece + 1)/ntsl::numeric * ts1.val 
     when old_id_start <> old_id_end then (ntsl::numeric - old_id_start_piece +1)/ntsl::numeric * ts1.val + coalesce((old_id_end_piece/ntsl::numeric * ts2.val), 0) end 
from 
    mapping_to_old_ts_ids oid 
    join time_series ts1 on (oid.old_id_start = ts1.id) 
    left join time_series ts2 on (oid.old_id_end = ts2.id) 
order by 
    new_id 

上面的查詢已經簡化了我原來的,更詳細的查詢的版本。如果你有興趣,這是我如何逐步找出解決方案(試圖拉伸5行爲8):

with time_series (id, val) as (values 
    (1, 1), 
    (2, 2), 
    (3, 3), 
    (4, 2), 
    (5, 1) 
) 

/* The basic idea is to divide every element into 8 pieces and then aggregate it 
    back by 5 elements. When trying to stretch 5 into 8, we will have 5 * 8 = 40 
    elements. For every element in new time series we can calculate what is the id 
    of first and last piece. */  
, piece_start_end as (
    select 
    gs as new_id, 
    (gs - 1) * 5 + 1 as piece_start, 
    gs * 5 as piece_end 
    from 
    generate_series(1, 8) gs 
) 


/* No we need to calculate where exactly in the old time series we have beginning 
and end of pieces. E.g. 1st element of new time series starts in element 1 at position 1 
and ends in element 1 at position 5. 2nd element of new time series starts in element 1 
at position 6 and ends in element 2 at position 2. */ 
, mapping_to_old_ts_ids as (
    select 
    *, 
    case when mod(piece_start, 8) <> 0 then piece_start/8 + 1 else piece_start/8 end as old_id_start, 
    case mod(piece_start, 8) when 0 then 8 else mod(piece_start, 8) end as old_id_start_piece, 

    case when mod(piece_end, 8) <> 0 then piece_end/8 + 1 else piece_end/8 end as old_id_end, 
    case mod(piece_end, 8) when 0 then 8 else mod(piece_end, 8) end as old_id_end_piece 
    from 
    piece_start_end 
) 

/* In final step we just need to assign final value to new time series by taking 
appropriate number of pieces from old time series elements. */ 


select 
    new_id, 

    old_id_start, 
    old_id_start_piece, 
    ts1.val as old_id_start_val, 

    old_id_end, 
    old_id_end_piece, 
    ts2.val as old_id_end_val, 

    case 
     when old_id_start = old_id_end then (old_id_end_piece - old_id_start_piece + 1)/8.0 * ts1.val 
     when old_id_start <> old_id_end then (8 - old_id_start_piece +1)/8.0 * ts1.val + coalesce((old_id_end_piece/8.0 * ts2.val), 0) end 

from 
    mapping_to_old_ts_ids oid 
    join time_series ts1 on (oid.old_id_start = ts1.id) 
    left join time_series ts2 on (oid.old_id_end = ts2.id) 
1

請嘗試此查詢不交叉連接。

首先我們用值的間隔生成ts1子查詢,然後用一個新的序列連接它。並且在選擇列表中插入(線性)新的ID到連接的值的間隔 - new_val

而且在此查詢,我們使用+1-1改造1,2,3,...序列0,1,2,....

select 
    gs as new_id, 
    Sval+(Eval-SVal)*((gs.gs-1) /(100.0/(11.0-1))+1-ts1.ID) as new_val, 
    SVal as StartInterval, 
    EVal as EndInterval  
from 
    (Select generate_series(1, 100) gs) gs 
    left join 
    (select T1.ID, T1.Val SVal,T2.Val EVal 
    FROM 
    time_series T1 
    JOIN time_series T2 ON T1.Id=T2.ID-1) ts1 
    ON floor((gs.gs-1) /(100.0/(11.0-1)))+1=ts1.ID 
order by 
gs 
+0

這絕對是快速,但它做了一些不同的。我需要這個'new_val'的總和與拉伸後原始時間序列中'val'的總和相同(我應該在我的答案中提到這一點)。當我的查詢返回'1000999.99'時,你將返回'1502249.50'。由於SQLFiddle關閉,我添加了具有2.000行的時間序列的測試數據到我的問題。 –