2014-07-25 111 views
6

我正在尋找一種有效的方法來查找時間戳集合之間的所有交集。它需要使用PostgreSQL 9.2。在PostgreSQL中查找所有組範圍的所有交集

假設範圍代表一個人可以見面的時間。每個人在可用時可能有一個或多個時間範圍。我想查找全部會議可以發生的時間段(即所有人都可用)。

這是我到目前爲止。它似乎有效,但我認爲它效率不高,因爲它一次考慮一個人的可用性。

WITH RECURSIVE td AS 
(
    -- Test data. Returns: 
    -- ["2014-01-20 00:00:00","2014-01-31 00:00:00") 
    -- ["2014-02-01 00:00:00","2014-02-20 00:00:00") 
    -- ["2014-04-15 00:00:00","2014-04-20 00:00:00") 
    SELECT 1 AS entity_id, '2014-01-01'::timestamp AS begin_time, '2014-01-31'::timestamp AS end_time 
    UNION SELECT 1, '2014-02-01', '2014-02-28' 
    UNION SELECT 1, '2014-04-01', '2014-04-30' 
    UNION SELECT 2, '2014-01-15', '2014-02-20' 
    UNION SELECT 2, '2014-04-15', '2014-05-05' 
    UNION SELECT 3, '2014-01-20', '2014-04-20' 
) 
, ranges AS 
(
    -- Convert to tsrange type 
    SELECT entity_id, tsrange(begin_time, end_time) AS the_range 
    FROM td 
) 
, min_max AS 
(
    SELECT MIN(entity_id), MAX(entity_id) 
    FROM td 
) 
, inter AS 
(
    -- Ranges for the lowest ID 
    SELECT entity_id AS last_id, the_range 
    FROM ranges r 
    WHERE r.entity_id = (SELECT min FROM min_max) 

    UNION ALL 

    -- Iteratively intersect with ranges for the next higher ID 
    SELECT entity_id, r.the_range * i.the_range 
    FROM ranges r 
    JOIN inter i ON r.the_range && i.the_range 
    WHERE r.entity_id > i.last_id 
     AND NOT EXISTS 
     (
      SELECT * 
      FROM ranges r2 
      WHERE r2.entity_id < r.entity_id AND r2.entity_id > i.last_id 
     ) 
) 
-- Take the final set of intersections 
SELECT * 
FROM inter 
WHERE last_id = (SELECT max FROM min_max) 
ORDER BY the_range; 
+3

無關:提供 「靜態數據」 可以在不使用'select'和'使用'值union'做短'子句:'values(1,'2014-01-01':: timestamp,'2014-01-31':: timestamp),(2,...)'並在CTE中定義列名稱。 –

回答

7

我創建了tsrange_interception_agg

create function tsrange_interception (
    internal_state tsrange, next_data_values tsrange 
) returns tsrange as $$ 
    select internal_state * next_data_values; 
$$ language sql; 

create aggregate tsrange_interception_agg (tsrange) (
    sfunc = tsrange_interception, 
    stype = tsrange, 
    initcond = $$[-infinity, infinity]$$ 
); 

那麼這個查詢

with td (id, begin_time, end_time) as 
(
    values 
    (1, '2014-01-01'::timestamp, '2014-01-31'::timestamp), 
    (1, '2014-02-01', '2014-02-28'), 
    (1, '2014-04-01', '2014-04-30'), 
    (2, '2014-01-15', '2014-02-20'), 
    (2, '2014-04-15', '2014-05-05'), 
    (3, '2014-01-20', '2014-04-20') 
), ranges as (
    select 
     id, 
     row_number() over(partition by id) as rn, 
     tsrange(begin_time, end_time) as tr 
    from td 
), cr as (
    select r0.tr tr0, r1.tr as tr1 
    from ranges r0 cross join ranges r1 
    where 
     r0.id < r1.id and 
     r0.tr && r1.tr and 
     r0.id = (select min(id) from td) 
) 
select tr0 * tsrange_interception_agg(tr1) as interseptions 
from cr 
group by tr0 
having count(*) = (select count(distinct id) from td) - 1 
; 
       interseptions     
----------------------------------------------- 
["2014-02-01 00:00:00","2014-02-20 00:00:00") 
["2014-01-20 00:00:00","2014-01-31 00:00:00") 
["2014-04-15 00:00:00","2014-04-20 00:00:00") 
+0

謝謝!我用這個總體思想來解決我原來的問題,這個問題比這個更復雜,所以我將其標記爲已接受。 (我不能把它簡單地歸結爲一組tsranges,但仍然使用一個集合,只是一個更復雜的集合。) – EM0

0

OK,我寫在TSQL測試這一點,但它應該運行或者至少是足夠接近你翻譯回來,這一切都相當香草結構。 除了可能之間,但可以分解爲<子句和一個>子句。(感謝@Horse)

WITH cteSched AS (--Schedule for everyone 
    -- Test data. Returns: 
    -- ["2014-01-20 00:00:00","2014-01-31 00:00:00") 
    -- ["2014-02-01 00:00:00","2014-02-20 00:00:00") 
    -- ["2014-04-15 00:00:00","2014-04-20 00:00:00") 
    SELECT 1 AS entity_id, '2014-01-01' AS begin_time, '2014-01-31' AS end_time 
    UNION SELECT 1, '2014-02-01', '2014-02-28' 
    UNION SELECT 1, '2014-04-01', '2014-04-30' 
    UNION SELECT 2, '2014-01-15', '2014-02-20' 
    UNION SELECT 2, '2014-04-15', '2014-05-05' 
    UNION SELECT 3, '2014-01-20', '2014-04-20' 
), cteReq as ( --List of people to schedule (or is everyone in Sched required? Not clear, doesn't hurt) 
    SELECT 1 as entity_id UNION SELECT 2 UNION SELECT 3 
), cteBegins as (
    SELECT distinct begin_time FROM cteSched as T 
    WHERE NOT EXISTS (SELECT entity_id FROM cteReq as R 
         WHERE NOT EXISTS (SELECT * FROM cteSched as X 
             WHERE X.entity_id = R.entity_id 
              AND T.begin_time BETWEEN X.begin_time AND X.end_time)) 
) SELECT B.begin_time, MIN(S.end_time) as end_time 
    FROM cteBegins as B cross join cteSched as S 
    WHERE B.begin_time between S.begin_time and S.end_time 
    GROUP BY B.begin_time 
-- NOTE: This assume users do not have schedules that overlap with themselves! That is, nothing like 
-- John is available 2014-01-01 to 2014-01-15 and 2014-01-10 to 2014-01-20. 

編輯:
BEGIN_TIME END_TIME(當SQL-2008R2服務器上執行)從上方添加輸出
2014年1月20日2014-01-31
2014-02- 01 2014年2月20日
2014年4月15日2014年4月20日

+0

'between'是每個DBMS支持的標準SQL。 –

+0

對不起,我無法得到這個工作(在我的原始數據),但無論如何感謝您的替代方法。 – EM0

+0

@ E-M什麼都不起作用?你有錯誤嗎?輸出是不同的? –

1

如果您想要交叉參考實體的固定數量,你可以使用一個交叉連接爲他們每個人,並建立交叉點(在範圍上使用*運算符)。

儘管如此,使用這種交叉連接可能效率較低。下面的例子更多地解釋了下面更復雜的例子。

WITH td AS 
(
    SELECT 1 AS entity_id, '2014-01-01'::timestamp AS begin_time, '2014-01-31'::timestamp AS end_time 
    UNION SELECT 1, '2014-02-01', '2014-02-28' 
    UNION SELECT 1, '2014-04-01', '2014-04-30' 
    UNION SELECT 2, '2014-01-15', '2014-02-20' 
    UNION SELECT 2, '2014-04-15', '2014-05-05' 
    UNION SELECT 4, '2014-01-20', '2014-04-20' 
) 
,ranges AS 
(
    -- Convert to tsrange type 
    SELECT entity_id, tsrange(begin_time, end_time) AS the_range 
    FROM td 
) 
SELECT r1.the_range * r2.the_range * r3.the_range AS r 
FROM ranges r1 
CROSS JOIN ranges r2 
CROSS JOIN ranges r3 
WHERE r1.entity_id=1 AND r2.entity_id=2 AND r3.entity_id=4 
    AND NOT isempty(r1.the_range * r2.the_range * r3.the_range) 
ORDER BY r 

在這種情況下,多個交叉連接可能是效率較低,因爲你實際上並不需要有在現實中每個範圍內的所有可能的組合,因爲isempty(r1.the_range * r2.the_range)是足以讓isempty(r1.the_range * r2.the_range * r3.the_range)真。

我不認爲你可以避免每個人的可用性,因爲你希望他們都能見面。

通過將每個人的可用性與使用另一個遞歸CTE(以下示例中的intersections)計算出的先前子集進行交叉連接,可以幫助您逐漸構建一組交叉點。然後,逐步建立交叉,擺脫空範圍,包括存儲陣列:

WITH RECURSIVE td AS 
(
    SELECT 1 AS entity_id, '2014-01-01'::timestamp AS begin_time, '2014-01-31'::timestamp AS end_time 
    UNION SELECT 1, '2014-02-01', '2014-02-28' 
    UNION SELECT 1, '2014-04-01', '2014-04-30' 
    UNION SELECT 2, '2014-01-15', '2014-02-20' 
    UNION SELECT 2, '2014-04-15', '2014-05-05' 
    UNION SELECT 4, '2014-01-20', '2014-04-20' 
) 
,ranges AS 
(
    -- Convert to tsrange type 
    SELECT entity_id, tsrange(begin_time, end_time) AS the_range 
    FROM td 
) 
,ranges_arrays AS (
    -- Prepare an array of all possible intervals per entity 
    SELECT entity_id, array_agg(the_range) AS ranges_arr 
    FROM ranges 
     GROUP BY entity_id 
) 
,numbered_ranges_arrays AS (
    -- We'll join using pos+1 next, so we want continuous integers 
    -- I've changed the example entity_id from 3 to 4 to demonstrate this. 
    SELECT ROW_NUMBER() OVER() AS pos, entity_id, ranges_arr 
    FROM ranges_arrays 
) 
,intersections (pos, subranges) AS (
    -- We start off with the infinite range. 
    SELECT 0::bigint, ARRAY['[,)'::tsrange] 
    UNION ALL 
    -- Then, we unnest the previous intermediate result, 
    -- cross join it against the array of ranges from the 
    -- next row in numbered_ranges_arrays (joined via pos+1). 
    -- We take the intersection and remove the empty array. 
    SELECT r.pos, 
      ARRAY(SELECT x * y FROM unnest(r.ranges_arr) x CROSS JOIN unnest(i.subranges) y WHERE NOT isempty(x * y)) 
    FROM numbered_ranges_arrays r 
     INNER JOIN intersections i ON r.pos=i.pos+1 
) 
,last_intersections AS (
    -- We just really want the result from the last operation (with the max pos). 
    SELECT subranges FROM intersections ORDER BY pos DESC LIMIT 1 
) 
SELECT unnest(subranges) r FROM last_intersections ORDER BY r 

我不知道這是否可能有更好的表現,很遺憾。您可能需要更大的數據集才能獲得有意義的基準。

+0

謝謝,這工作,雖然性能類似於我原來的查詢。總體解決方案結束時速度更快。 – EM0

+0

@EM很高興知道。出於好奇,如果你有一個數據集來測試這附近,你可以試着看看在我的查詢中'CROSS JOIN'中的'WHERE NOT isempty(x * y)'改成'WHERE x && y'是否改善了性能? – Bruno