僅過濾postgres表中的值更改

我有一個非常奇怪的問題。我有如下44萬人次的紀錄表：僅過濾postgres表中的值更改

SKU | Timestamp   | Status 
A | 21-09-2016 12:30:00 | 1 
B | 21-09-2016 12:30:00 | 1 
C | 21-09-2016 12:30:00 | 1 
D | 21-09-2016 12:30:00 | 1 
A | 21-09-2016 12:39:00 | 0 
B | 21-09-2016 12:40:00 | 0 
C | 21-09-2016 12:40:00 | 0 
D | 21-09-2016 12:45:00 | 0 
A | 21-09-2016 12:52:00 | 1 
A | 21-09-2016 12:56:00 | 1 
A | 21-09-2016 12:58:00 | 1 
B | 21-09-2016 12:59:00 | 1 
A | 21-09-2016 21:30:00 | 0

現在的要求是我們應該考慮只有在狀態發生變化的記錄。例如，在上表中，SKU A從21-09-2016 12:30:00開始狀態爲1。我們現在就來看看未來的記錄，看看做了記錄變化時，當狀態變爲0，因此下一個變化是在21-09-2016 21:30:00看到現在，我們需要一個表與下面的輸出

SKU | Timestamp   | Status 
A | 21-09-2016 12:30:00 | 1 
A | 21-09-2016 12:39:00 | 0 
A | 21-09-2016 12:52:00 | 1 
A | 21-09-2016 21:30:00 | 0 
B | 21-09-2016 12:30:00 | 1 
B | 21-09-2016 12:40:00 | 0 
B | 21-09-2016 12:59:00 | 1 
C | 21-09-2016 12:30:00 | 1 
C | 21-09-2016 12:40:00 | 0 
D | 21-09-2016 12:30:00 | 1 
D | 21-09-2016 12:45:00 | 0

來源

2016-12-23 Saurabh Omar

我想你想要lag()：

select t.* 
from (select t.*, 
      lag(status) over (partition by sku order by timestamp) as prev_status 
     from t 
    ) t 
where (prev_status is distinct from status) ;

注：is distinct from很像<>，但它更直觀地處理NULL值。

來源

2016-12-23 12:20:41

無需鉛 –

嗨戈登，感謝您的回答。我們應該預計這將花費4400萬記錄表/ –

@SaurabhOmar。。。 '（sku，timestamp，status）'上的索引應該有助於加快查詢速度。 –

select sku, timestamp, status 
from (
    select *, lag(status) over (partition by sku order by timestamp) as prev_status 
    from example 
    ) s 
where prev_status is distinct from status;

Test it here。

來源

2016-12-23 12:32:46 klin

此外，以克林的和戈登的回答，回答到

要多少時間預計這種採取了44萬人次的紀錄表

強烈依賴於內存可供PostgreSQL的。因爲子查詢的結果應該存儲在某處（然後再次掃描）。

如果RAM容量足以存儲中間結果 - 那麼一切OK，如果不是 - 你的煩惱。

例如，在我與10,000,000行我不得不等待更多的則在15分鐘後取消平原查詢表測試。

可替換地，利用存儲功能，它是在約4分鐘內完成這是不得多然後簡單排序選擇（約2分鐘）。

這裏是我的測試：

-- Create data 

--drop function if exists foo(); 
--drop table if exists test; 
create table test (i bigserial primary key, sku char(1), ts timestamp, status smallint); 

insert into test (sku, ts, status) 
    select 
    chr(ascii('A') + (random()*3)::int), 
    now()::date + ((random()*100)::int || ' minutes')::interval, 
    (random()::int) 
    from generate_series(1,10000000); 

create index idx on test(sku, ts); 

analyse test; 

-- And function 

create or replace function foo() returns setof test language plpgsql as $$ 
declare 
    r test; 
    p test; 
begin 
    for r in select * from test order by sku, ts loop 
    if p.status is distinct from r.status or p.sku is distinct from r.sku then 
     return next r; 
    end if; 
    p := r; 
    end loop; 
    return; 
end $$; 

-- Test queries 

explain (analyse, verbose) 
select i, sku, ts, status 
from (
    select *, lag(status) over (partition by sku order by ts) as prev_status 
    from test 
    ) s 
where prev_status is distinct from status; 
-- Not completed, still working after ~ 15 min 

explain analyse select * from test order by sku, ts; 
-- Complete in ~2 min 

explain (analyse, verbose) select * from foo(); 
-- Complete in ~3:30 min

來源

2016-12-23 18:22:39 Abelisto

不知道你有什麼樣的硬件。具有窗口函數的select語句在我的臺式機上只需要10秒鐘：https://explain.depesz.com/s/VrVU「簡單」選擇只需5秒鐘：https：//explain.depesz.com/s/puiO函數需要16秒。 https://explain.depesz.com/s/N8B9 –

當你生成5千萬行（而不是你的例子中的10）時，我會得到你的運行時間。窗口函數查詢需要2.5分鐘：https://explain.depesz.com/s/TGq普通選擇大約相同：https://explain.depesz.com/s/D4kl和函數調用大約3分鐘： https://explain.depesz.com/s/d8Yv –

@a_horse_with_no_name因此，試試1億行，可能你會得到不同:)然而關於我的硬件 - 這是〜6 YO筆記本，我認爲這種差異也由於低HD速度。 – Abelisto

僅過濾postgres表中的值更改

回答

相關問題