在Postgres 9.1上更新查詢太慢

我的問題是，我有一個非常緩慢的更新查詢在一個包含1400萬行的表上。我嘗試了不同的事情來調整我的服務器，它帶來了良好的性能，但不適用於更新查詢。在Postgres 9.1上更新查詢太慢

我有兩個表：

T1具有4個列和3個索引它（530行）
T2具有15列和在其上3個索引（14個百萬行）
我想通過在文本字段stxt中加入兩個表來更新T2中字段vid（類型整數）的相同T1值。

這裏是我的查詢和輸出：

explain analyse 
update T2 
    set vid=T1.vid 
from T1 
where stxt2 ~ stxt1 and T2.vid = 0;

 
Update on T2 (cost=0.00..9037530.59 rows=2814247 width=131) (actual time=25141785.741..25141785.741 rows=0 loops=1) 
-> Nested Loop (cost=0.00..9037530.59 rows=2814247 width=131) (actual time=32.636..25035782.995 rows=679354 loops=1) 
      Join Filter: ((T2.stxt2)::text ~ (T1.stxt1)::text) 
      -> Seq Scan on T2 (cost=0.00..594772.96 rows=1061980 width=121) (actual time=0.067..5402.614 rows=1037809 loops=1) 
         Filter: (vid= 1) 
      -> Materialize (cost=0.00..17.95 rows=530 width=34) (actual time=0.000..0.069 rows=530 loops=1037809) 
         -> Seq Scan on T1 (cost=0.00..15.30 rows=530 width=34) (actual time=0.019..0.397 rows=530 loops=1) 
Total runtime: 25141785.904 ms

正如你可以看到查詢了約25141秒（約7小時）。如果我理解的很好，計劃者估計執行時間爲9037秒（〜2.5小時）。我在這裏錯過了什麼嗎？

這裏是我的服務器配置信息：

的CentOS 5.8，20GB的RAM
的shared_buffers = 12GB
work_mem = 64MB
maintenance_work_mem = 64MB
bgwriter_lru_maxpages = 500
checkpoint_segments = 64
checkpoint_completion_target = 0.9
effective_cache_size = 10GB

我已經運行滿真空和表T2分析多次但仍沒有太大改善的情況。 PS：如果我將full_page_writes設置爲關閉，這將大大改善更新查詢，但我不想冒數據丟失的風險。你有任何建議嗎？

來源

2012-07-08 datatanger

嘗試使用MERGE代替。它可以更快地鏈接表格。 – Samson 2012-07-08 09:15:33

你真的需要〜操作符嗎？ stxt1，stxt2字段中有什麼，以及它們的類型是什麼？ – wildplasser 2012-07-08 10:15:50

@wildplasser〜運算符幾乎等同於'stxt2'，如'％'|| stxt1 ||'％''。兩個字段stxt都是字符變化的。 @radashk我試過這個[link]（http://petereisentraut.blogspot.com/2010/05/merge-syntax.html），但是Postgres總是告訴我錯誤：語法錯誤處於或接近「MERGE」。我該如何嘗試「合併」？ – datatanger 2012-07-08 10:27:36

這不是一個解決方案，但是一個數據的建模工作周圍

分手的URL轉換成{協議，主機名，路徑名}組件。
現在，您可以使用完全匹配來加入主機名部分，避免了正則表達式匹配中的前導％。
該視圖旨在證明full_url可以根據需要進行重構。

更新可能需要幾分鐘的時間。

SET search_path='tmp'; 

DROP TABLE urls CASCADE; 
CREATE TABLE urls 
     (id SERIAL NOT NULL PRIMARY KEY 
     , full_url varchar 
     , proto varchar 
     , hostname varchar 
     , pathname varchar 
     ); 

INSERT INTO urls(full_url) VALUES 
('ftp://www.myhost.com/secret.tgz') 
,('http://www.myhost.com/robots.txt') 
,('http://www.myhost.com/index.php') 
,('https://www.myhost.com/index.php') 
,('http://www.myhost.com/subdir/index.php') 
,('https://www.myhost.com/subdir/index.php') 
,('http://www.hishost.com/index.php') 
,('https://www.hishost.com/index.php') 
,('http://www.herhost.com/index.php') 
,('https://www.herhost.com/index.php') 
     ; 

UPDATE urls 
SET proto = split_part(full_url, '://' , 1) 
     , hostname = split_part(full_url, '://' , 2) 
     ; 

UPDATE urls 
SET pathname = substr(hostname, 1+strpos(hostname, '/')) 
     , hostname = split_part(hostname, '/' , 1) 
     ; 

     -- the full_url field is now redundant: we can drop it 
ALTER TABLE urls 
     DROP column full_url 
     ; 
     -- and we could always reconstruct the full_url from its components. 
CREATE VIEW vurls AS (
     SELECT id 
     , proto || '://' || hostname || '/' || pathname AS full_url 
     , proto 
     , hostname 
     , pathname 
     FROM urls 
     ); 

SELECT * FROM urls; 
     ; 
SELECT * FROM vurls; 
     ;

OUTPUT：

INSERT 0 10 
UPDATE 10 
UPDATE 10 
ALTER TABLE 
CREATE VIEW 
id | proto | hostname  |  pathname  
----+-------+-----------------+------------------ 
    1 | ftp | www.myhost.com | secret.tgz 
    2 | http | www.myhost.com | robots.txt 
    3 | http | www.myhost.com | index.php 
    4 | https | www.myhost.com | index.php 
    5 | http | www.myhost.com | subdir/index.php 
    6 | https | www.myhost.com | subdir/index.php 
    7 | http | www.hishost.com | index.php 
    8 | https | www.hishost.com | index.php 
    9 | http | www.herhost.com | index.php 
10 | https | www.herhost.com | index.php 
(10 rows) 

id |    full_url     | proto | hostname  |  pathname  
----+-----------------------------------------+-------+-----------------+------------------ 
    1 | ftp://www.myhost.com/secret.tgz   | ftp | www.myhost.com | secret.tgz 
    2 | http://www.myhost.com/robots.txt  | http | www.myhost.com | robots.txt 
    3 | http://www.myhost.com/index.php   | http | www.myhost.com | index.php 
    4 | https://www.myhost.com/index.php  | https | www.myhost.com | index.php 
    5 | http://www.myhost.com/subdir/index.php | http | www.myhost.com | subdir/index.php 
    6 | https://www.myhost.com/subdir/index.php | https | www.myhost.com | subdir/index.php 
    7 | http://www.hishost.com/index.php  | http | www.hishost.com | index.php 
    8 | https://www.hishost.com/index.php  | https | www.hishost.com | index.php 
    9 | http://www.herhost.com/index.php  | http | www.herhost.com | index.php 
10 | https://www.herhost.com/index.php  | https | www.herhost.com | index.php 
(10 rows)

來源

2012-07-08 14:50:52 wildplasser

現在嘗試運行這兩個更新以獲取臨時表中的{protocol，hostname，pathname}組件。不要刪除full_url字段。 – wildplasser 2012-07-08 16:21:40

我有一個帖子的答案，但我必須等待27分鐘:)。我是新的，沒有足夠的聲譽 – datatanger 2012-07-08 16:35:20

順便說一句：你可以使用上述臨時表作爲聯結表;將full_url翻譯爲（canonical？）主機名。 – wildplasser 2012-07-08 16:38:15

謝謝你，這會帶來一些幫助。因此，這裏是我所做的：

我創建的表的URL正如你所提到
我已經添加了整數類型的VID列到它
我插入了T2
在full_url列百萬行我啓用時間，並更新了full_url hostname列不包含既不是 'HTTP' 也不是 'WWW' update urls set hostname=full_url where full_url not like '%/%' and full_url not like 'www\.%';

Time: 112435.192 ms

然後我運行此查詢：

mydb=> explain analyse update urls set vid=vid from T1 where hostname=stxt1; 
      QUERY PLAN               
      ----------------------------------------------------------------------------------------------------------------------------- 
      Update on urls (cost=21.93..37758.76 rows=864449 width=124) (actual time=767.793..767.793 rows=0 loops=1) 
       -> Hash Join (cost=21.93..37758.76 rows=864449 width=124) (actual time=102.324..430.448 rows=94934 loops=1) 
          Hash Cond: ((urls.hostname)::text = (T1.stxt1)::text) 
          -> Seq Scan on urls (cost=0.00..25612.52 rows=927952 width=114) (actual time=0.009..265.962 rows=927952 loops=1) 
          -> Hash (cost=15.30..15.30 rows=530 width=34) (actual time=0.444..0.444 rows=530 loops=1) 
             Buckets: 1024 Batches: 1 Memory Usage: 35kB 
             -> Seq Scan on T1 (cost=0.00..15.30 rows=530 width=34) (actual time=0.002..0.181 rows=530 loops=1) 
      Total runtime: 767.860 ms

我是真的總運行時間感到驚訝！少於1秒這確認了你說的有關完全匹配更新的內容。

mydb=> select count(*) from T2 where vid is null and exists(select null from T1 where stxt1=stxt2); 
count 
-------- 
308486 
(1 row)

因此我嘗試了T2表的更新，並得到這個：

mydb=> explain analyse update T2 set vid = T1.vid from T1 where T2.vid is null and stxt2=stxt1; 
                                  QUERY PLAN                
--------------------------------------------------------------------------------------------------------------------------------------- 
Update on T2 (cost=21.93..492023.13 rows=2106020 width=131) (actual time=252395.118..252395.118 rows=0 loops=1) 
    -> Hash Join (cost=21.93..492023.13 rows=2106020 width=131) (actual time=1207.897..4739.515 rows=308486 loops=1) 
       Hash Cond: ((T2.stxt2)::text = (T1.stxt1)::text) 
       -> Seq Scan on T2 (cost=0.00..455452.09 rows=4130377 width=121) (actual time=158.773..3915.379 rows=4103865 loops=1) 
          Filter: (vid IS NULL) 
       -> Hash (cost=15.30..15.30 rows=530 width=34) (actual time=0.293..0.293 rows=530 loops=1) 
          Buckets: 1024 Batches: 1 Memory Usage: 35kB 
          -> Seq Scan on T1 (cost=0.00..15.30 rows=530 width=34) (actual time=0.005..0.121 rows=530 loops=1) 
Total runtime: 252395.204 ms 
(9 rows) 

Time: 255389.704 ms

其實255秒似乎是一個很好的時間，現在我搜索需要付出xtxt1和stxt2這種方式之間匹配對於這樣的查詢。我會嘗試從所有網址中提取主機名部分並進行更新。我仍然應該確保使用精確匹配進行更新的速度很快，因爲我對它有不好的經驗。

感謝您的支持。

來源

2012-07-08 17:02:20 datatanger

下面是我之前對功能索引的評論的一個擴展示例。如果你使用postgresql並且不知道函數索引是什麼，那麼你可能正因爲它而感到痛苦。

讓我們創建一個測試表放入一些數據吧：

smarlowe=# create table test (a text, b text, c int); 
smarlowe=# insert into test select 'abc','z',0 from generate_series(1,1000000); -- 1 million rows that don't match 
smarlowe=# insert into test select 'abc','a',0 from generate_series(1,10); -- 10 rows that do match 
smarlowe=# insert into test select 'abc','z',1 from generate_series(1,1000000); -- another million rows that won't match.

現在，我們要在其上運行一些查詢測試：

\timing 
select * from test where a ~ b and c=0; -- ignore how long this takes 
select * from test where a ~ b and c=0; -- run it twice to get a read with cached data.

在我的筆記本電腦這大約需要750毫秒。這種基於C經典指數：

smarlowe=# create index test_c on test(c); 
smarlowe=# select * from test where a ~ b and c=0;

發生在我的筆記本電腦〜400毫秒。

此功能指數壽：

smarlowe=# drop index test_c ; 
smarlowe=# create index test_regex on test (c) where (a~b); 
smarlowe=# select * from test where a ~ b and c=0;

現在運行在1.3ms。

當然，沒有免費午餐這樣的東西，您將在更新/插入期間爲此索引付款。

來源

2012-07-11 21:01:42

當然有一定的代價，但在你的例子中有一個選擇性的'WHERE'子句的部分索引是相當便宜的，並且會導致一個很小的索引。很有用。 – 2012-07-11 21:21:54

@Scott Marlowe謝謝你的提示。所以功能性索引是部分索引？我想我已經在官方文檔中閱讀過關於它們的內容，但從未使用它們。我不知道，但我thnik他們只在一些specefic情況下有用。 – datatanger 2012-07-12 08:58:21

在Postgres 9.1上更新查詢太慢

回答

相關問題