2016-04-03 60 views
2

我有這樣的在線課程表(空行是隻是爲了更好的可見性):獲取envelope.ie重疊的時間跨度

ip_address | start_time  | stop_time 
------------|------------------|------------------ 
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 08:12 
10.10.10.10 | 2016-04-02 08:11 | 2016-04-02 08:20 

10.10.10.10 | 2016-04-02 09:00 | 2016-04-02 09:10 
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:08 
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:11 
10.10.10.10 | 2016-04-02 09:02 | 2016-04-02 09:15 
10.10.10.10 | 2016-04-02 09:10 | 2016-04-02 09:12 

10.66.44.22 | 2016-04-02 08:05 | 2016-04-02 08:07 
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11 

而我需要的「包圍」在線時間跨度:

ip_address | full_start_time | full_stop_time 
------------|------------------|------------------ 
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 08:20 
10.10.10.10 | 2016-04-02 09:00 | 2016-04-02 09:15 
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11 

我有此查詢返回所需的結果:

WITH t AS 
    -- Determine full time-range of each IP 
    (SELECT ip_address, MIN(start_time) AS min_start_time, MAX(stop_time) AS max_stop_time FROM IP_SESSIONS GROUP BY ip_address), 
t2 AS 
    -- compose ticks 
    (SELECT DISTINCT ip_address, min_start_time + (LEVEL-1) * INTERVAL '1' MINUTE AS ts 
    FROM t 
    CONNECT BY min_start_time + (LEVEL-1) * INTERVAL '1' MINUTE <= max_stop_time), 
t3 AS 
    -- get all "online" ticks 
    (SELECT DISTINCT ip_address, ts 
    FROM t2 
     JOIN IP_SESSIONS USING (ip_address) 
    WHERE ts BETWEEN start_time AND stop_time), 
t4 AS 
    (SELECT ip_address, ts, 
     LAG(ts) OVER (PARTITION BY ip_address ORDER BY ts) AS previous_ts 
    FROM t3), 
t5 AS 
    (SELECT ip_address, ts, 
     SUM(DECODE(previous_ts,NULL,1,0 + (CASE WHEN previous_ts + INTERVAL '1' MINUTE <> ts THEN 1 ELSE 0 END))) 
      OVER (PARTITION BY ip_address ORDER BY ts ROWS UNBOUNDED PRECEDING) session_no 
    FROM t4) 
SELECT ip_address, MIN(ts) AS full_start_time, MAX(ts) AS full_stop_time 
FROM t5 
GROUP BY ip_address, session_no 
ORDER BY 1,2; 

不過,我關心的性能。該表有幾百萬行,時間分辨率是毫秒(而不是例子中給出的一分鐘)。因此CTE t3會很大。有沒有人有避免自我加入和「連接」的解決方案?

單個智能Analytic Function會很棒。

回答

3

也試試這個。我盡我所能對它進行了測試,我相信它涵蓋了所有可能性,包括合併相鄰間隔(10:15至10:30和10:30至10:40合併爲一個間隔,10:15至10:40 )。它也應該是相當快的,它並沒有太多用處。

with m as 
     (
     select ip_address, start_time, 
        max(stop_time) over (partition by ip_address order by start_time 
          rows between unbounded preceding and 1 preceding) as m_time 
     from ip_sessions 
     union all 
     select ip_address, NULL, max(stop_time) from ip_sessions group by ip_address 
     ), 
    n as 
     (
     select ip_address, start_time, m_time 
     from m 
     where start_time > m_time or start_time is null or m_time is null 
     ), 
    f as 
     (
     select ip_address, start_time, 
      lead(m_time) over (partition by ip_address order by start_time) as stop_time 
     from n 
     ) 
select * from f where start_time is not null 
/
+0

不錯的解決方案,我也沒有看到任何問題。 –

+1

@WernfriedDomscheit - 如果你仍然關心這類問題,我發現Stew Ashton在他的博客上有更好的解決方案。它應該是我的兩倍。 https://stewashton.wordpress.com/2015/06/08/merging-overlapping-date-ranges/ – mathguy

+0

偉大的方法。是的,它應該更快,因爲它不包含「UNION ALL」。我會測試它。 –

0

我想用lag()和累計總和將有更好的性能:

select ip_address, min(start_time) as full_start_time, 
     max(end_time) as full_end_time 
from (select t.*, 
      sum(case when prev_et >= start_time then 0 else 1 end) over 
       (partition by ip_address order by start_time) as grp 
     from (select s.*, 
        lag(end_time) over (partition by ip_address order by end_time) as prev_et 
      from ip_seesions s) 
      ) t 
group by grp, ip_address 
order by 1, 2; 

給出了結果:

ip_address | full_start_time | full_stop_time 
------------|------------------|------------------ 
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 09:15 
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:12 
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11 
10.66.44.22 | 2016-04-02 08:05 | 2016-04-02 08:07 
+0

不起作用。 IP 10.10.10.10從08:20:01至08:59:59脫機。 IP 10.66.44.22是從08:03到08:11在線的(我編輯你的查詢結果的答案) –

1

請測試這個解決方案,它爲你的例子,但也有可能有些情況我沒有注意到。沒有連接,沒有自我連接。

with io as (
    select * from (
    select ip_address, t1, io, sum(io) over (partition by ip_address order by t1) sio 
     from (
     select ip_address, start_time t1, 1 io from ip_sessions 
     union all 
     select ip_address, stop_time, -1 io from ip_sessions)) 
    where (io = 1 and sio = 1) or (io = -1 and sio = 0)) 
select ip_address, t1, t2 
    from (
    select io.*, lead(t1) over (partition by ip_address order by t1) as t2 from io) 
    where io = 1 

測試數據:

create table ip_sessions (ip_address varchar2(15), start_time date, stop_time date); 
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 08:00:00', timestamp '2016-04-02 08:12:00'); 
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 08:11:00', timestamp '2016-04-02 08:20:00'); 
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:00:00', timestamp '2016-04-02 09:10:00'); 
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:05:00', timestamp '2016-04-02 09:08:00'); 
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:02:00', timestamp '2016-04-02 09:15:00'); 
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:10:00', timestamp '2016-04-02 09:12:00'); 
insert into ip_sessions values ('10.66.44.22', timestamp '2016-04-02 08:05:00', timestamp '2016-04-02 08:07:00'); 
insert into ip_sessions values ('10.66.44.22', timestamp '2016-04-02 08:03:00', timestamp '2016-04-02 08:11:00'); 

輸出:

IP_ADDRESS T1     T2 
----------- ------------------- ------------------- 
10.10.10.10 2016-04-02 08:00:00 2016-04-02 08:20:00 
10.10.10.10 2016-04-02 09:00:00 2016-04-02 09:15:00 
10.66.44.22 2016-04-02 08:03:00 2016-04-02 08:11:00 
+0

如果你插入一行也是如此:INSERT INTO IP_SESSIONS VALUES('10.10.10.10',TIMESTAMP' 2016-04-02 09:00:00',TIMESTAMP'2016-04-02 09:16:00');' –

+0

...因爲在這種情況下我們有兩個會話開始於9:00。在第三行中將'union all'更改爲'union'(這可能會降低性能)或在「無界前導和當前行之間」添加行。 –

+0

UNION而不是UNION ALL將不起作用,如果從9:00到9:12以及從9:00到9:15有兩個時間間隔,您將選擇較短的時間間隔並錯過9:12至9:15間隔。建議:儘量不要對錶和列使用相同的名稱(io)。還有一點,這個解決方案可能會錯過9點到9點12分和9點12分到9點18分;據推測,結果應該是9點到9:18。可能的修復 - 在sio的定義中,在over子句中,將順序更改爲「order by t1,io desc」。 – mathguy

0

在我結束了其滿足我的要求的函數結束。 我想,它和思考斯蒂本斯的答案一樣。

CREATE OR REPLACE TYPE SESSION_REC AS OBJECT (START_TIME TIMESTAMP_UNCONSTRAINED, STOP_TIME TIMESTAMP_UNCONSTRAINED); 
CREATE OR REPLACE TYPE SESSION_TYPE AS TABLE OF SESSION_REC; 
CREATE OR REPLACE TYPE TIMESTAMP_TAB AS TABLE OF TIMESTAMP_UNCONSTRAINED; 

CREATE OR REPLACE FUNCTION ENVELOP_SESSIONS(v_ipaddress IN VARCHAR2) 
    RETURN SESSION_TYPE PIPELINED IS 

    rec SESSION_REC; 
    startTimes TIMESTAMP_TAB; 
    stopTimes TIMESTAMP_TAB; 

    TYPE ActionRecType IS RECORD (TS TIMESTAMP_UNCONSTRAINED, ACTION INTEGER); 
    TYPE ActionTableType IS TABLE OF ActionRecType; 
    actions ActionTableType; 
    onlineCount INTEGER := 0; 

BEGIN 

    SELECT START_TIME, STOP_TIME 
    BULK COLLECT INTO startTimes, stopTimes 
    FROM IP_SESSIONS 
    WHERE IP_ADDRESS = v_ipaddress; 

    WITH t AS 
     (SELECT COLUMN_VALUE AS ts, 1 AS action 
     FROM TABLE(startTimes) 
     UNION ALL 
     SELECT COLUMN_VALUE AS ts, -1 AS action 
     FROM TABLE(stopTimes)) 
    SELECT ts, action 
    BULK COLLECT INTO actions 
    FROM t 
    ORDER BY ts, action; 

    IF actions.COUNT > 0 THEN 
     FOR i IN actions.FIRST..actions.LAST LOOP  
      IF onlineCount = 0 AND actions(i).ACTION = 1 THEN 
       -- session starts 
       rec := SESSION_REC(actions(i).TS, NULL); 
      ELSIF onlineCount = 1 AND actions(i).ACTION = -1 THEN 
       -- session ends 
       rec := SESSION_REC(rec.START_TIME, actions(i).TS); 
       PIPE ROW(rec); 
      END IF; 
      onlineCount := onlineCount + actions(i).ACTION; 
     END LOOP;  
    END IF; 
    RETURN;  

END ENVELOP_SESSIONS;