2017-04-25 19 views
0

我有一個表像:選擇前N行,其中文本字段的長度的總和達到一定極限

CREATE TABLE cache (
    id BIGSERIAL PRIMARY KEY, 
    source char(2) NOT NULL, 
    target char(2) NOT NULL, 
    q TEXT NOT NULL, 
    result TEXT, 
    profile TEXT NOT NULL DEFAULT '', 
    created TIMESTAMP NOT NULL DEFAULT now(), 
    api_engine text NOT NULL, 
    encoded TEXT NOT NULL 
); 

我要越過的名單編碼場(也許OVER。 ..窗口) 的東西,如:

SELECT id, string_agg(encoded, '&q=') FROM cache 

,所以我將有對應的ID列表,並編碼級聯字段的字符串:'&q=encoded1&q=encoded2&q=encoded3' ......總樂不超過一些限制(比如不超過2000個字符)。

第二種情況,我想進入下一個窗口,當其中一個字段:來源,目標或配置文件被更改。

如果在FOR LOOP中可以使用SQL SELECT?

我知道如何用plpgsql/plpython/plperl做到這一點,但我想優化這個請求。

FOR rec IN 
    SELECT array_agg(id) AS ids, string_agg(encoded, '&q=') AS url FROM cache 
    WHERE result IS NULL 
    ORDER BY source, target 
LOOP 
    -- here I call curl with that *url* 

實施例的數據:

INSERT INTO cache (id, source, target, q, result, profile, api_engine, encoded) VALUES 
    (1, 'ru', 'en', 'Длинная фраза по-русски'   , NULL, '', 'google', '%D0%94%D0%BB%D0%B8%D0%BD%D0%BD%D0%B0%D1%8F+%D1%84%D1%80%D0%B0%D0%B7%D0%B0+%D0%BF%D0%BE-%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%B8') 
, (2, 'ru', 'es', 'Ещё одна непонятная фраза по-русски', NULL, '', 'google', '%D0%95%D1%89%D1%91+%D0%BE%D0%B4%D0%BD%D0%B0+%D0%BD%D0%B5%D0%BF%D0%BE%D0%BD%D1%8F%D1%82%D0%BD%D0%B0%D1%8F+%D1%84%D1%80%D0%B0%D0%B7%D0%B0+%D0%BF%D0%BE-%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%B8') 
-- etc... 

等,100500行這樣。字段來源目標可以是不同的語言代碼,他們重複,所以我需要也許做GROUP BY source, target, profile

我想選擇前N行,其中場的級聯與一些定界符編碼

&q=%D0%94%D0%BB%D0%B8%D0%BD%D0%BD%D0%B0%D1%8F+%D1%84%D1%80%D0%B0%D0%B7%D0%B0+%D0%BF%D0%BE-%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%B8&q=%D0%95%D1%89%D1%91+%D0%BE%D0%B4%D0%BD%D0%B0+%D0%BD%D0%B5%D0%BF%D0%BE%D0%BD%D1%8F%D1%82%D0%BD%D0%B0%D1%8F+%D1%84%D1%80%D0%B0%D0%B7%D0%B0+%D0%BF%D0%BE-%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%B8 

所以這個級聯字符串的長度不大於(2000)字符更。因此,我將擁有該字符串,並且還包含url中包含的所有這些行的id(以相同順序,當然)。

然後我想用相同的標準選擇下N行,依此類推。

+3

您可以編輯您的問題,並提供樣本數據和預期的效果? –

+0

如果長度超過2000個字符會發生什麼? –

+0

如果長度超過配額,我將結果推送到函數中,該函數將從http API發送/接收緩慢的答案,並從我的表格的下一部分重新開始。 – Dimitri

回答

1

您可以用智能遞歸CTE做到這一點:

WITH RECURSIVE c AS (-- 1st CTE is not recursive 
    SELECT dense_rank() OVER (ORDER BY  source, target, profile)    AS rnk 
     , row_number() OVER (PARTITION BY source, target, profile ORDER BY id) AS rn 
     , lead(encoded) OVER (PARTITION BY source, target, profile ORDER BY id) AS next_enc 
     , id, encoded 
    FROM cache 
    ) 

, rcte AS ( -- "recursion" starts here 
    SELECT rnk, rn, ARRAY[id] AS ids, encoded AS url 
     , CASE WHEN length(concat_ws('&q=', encoded || next_enc)) > 2000 -- max len 
       OR next_enc IS NULL -- last in partition 
       THEN TRUE END AS print 
    FROM c 
    WHERE rn = 1 

    UNION ALL 
    SELECT c.rnk, c.rn 
     , CASE WHEN r.print THEN ARRAY[id] ELSE r.ids || c.id      END AS ids 
     , CASE WHEN r.print THEN c.encoded ELSE concat_ws('&q=', r.url, c.encoded) END AS url 
     , CASE WHEN length(
      CASE WHEN r.print THEN concat_ws('&q=', c.encoded, c.next_enc) 
        ELSE concat_ws('&q=', r.url, c.encoded, c.next_enc) END) > 2000 -- max len 
       OR c.next_enc IS NULL -- last in partition 
       THEN TRUE END AS print 
    FROM rcte r 
    JOIN  c USING (rnk) 
    WHERE c.rn = r.rn + 1 
    ) 
SELECT ids, url 
FROM rcte 
WHERE print 
ORDER BY rnk, rn; 

關於rCTE包括非遞歸CTE:

但是,這大概是一個在plpgsql函數中循環的罕見情況實際上更快。

更多解釋見本相關答案:

+0

謝謝Erwin,我從來沒有聽說過CTE。它看起來比普通循環複雜得多。您的請求確實返回了dublicates。 ID,網址: {1,2,3,4,18,19,21,22,23,25,37},'%%% here long string'; {1,2,3,4,18,19,21,22,23,25,37,38},'%%% previous string + some data'; {1,2,3,4,18,19,21,22,23,25,37,38,39},'%%%等等'; – Dimitri

+0

@Dimitri:對不起,我很匆忙,忘了打印後重新啓動聚合。現在修復。無論如何,這只是一個概念證明。我幾乎可以肯定的是,對於這種特殊情況,循環遍歷表更快更簡單。 –