2011-08-04 33 views
1

得到記錄我有了數十億條記錄這樣一個巨大的表:甲骨文如何對數據進行分區,並以每10%

ID | H | N | Q | other 
-----+-----+------+-----+-------- 
AAAA | 0 | 7 | Y | ... 
BBBB | 1 | 5 | Y | ... 
CCCC | 0 | 11 | N | ... 
DDDD | 3 | 123 | N | ... 
EEEE | 6 | 4 | Y | ... 

這四列是索引的一部分。我想要做的是構建一個查詢,給我第一行,後面跟着10%,20%,30%,40%......這樣的查詢總是會給我10行,不管有多大該表(只要#rows> = 10)。

這甚至可能與SQL?如果是這樣,我該怎麼做?它有什麼樣的性能特點?

+1

你訂購由什麼來決定的結果是什麼表中10%的位置?只是通過'ID'? –

+0

是,ID和N.H是僅基於ID的預先計算的值。 –

回答

3

一種選擇是

SELECT id, 
     h, 
     n, 
     q 
    FROM (
    SELECT id, 
      h, 
      n, 
      q, 
      row_number() over (partition by decile order by id, n) rn 
     FROM (
     SELECT id, 
       h, 
       n, 
       q, 
       ntile(10) over (order by id, n) decile 
      FROM your_table 
      ) 
     ) 
    WHERE rn = 1 

有可能是使用PERCENTILE_DISCCUME_DIST未此刻點擊對我來說是更有效的方法。但這應該工作。

+0

我必須接受這個,但對我的需求來說太慢了。我最終只是計算了表中有多少行,然後執行了1/1000次查詢,它們執行了'max(id)where id>?和rownum <=?'然後我只是採取任何最後的ID是,並將其插回到查詢和存儲每100第一,直到我得到全部10.它似乎比這個分區查詢快得多..我會測試明天再回到你身邊。 –

0

您可以使用直方圖來獲取此信息。其巨大的缺點是結果只會是近似的,很難說它們的近似程度。你需要收集表格統計信息來刷新結果,但你可能已經這樣做了。積極的一面,查詢結果將非常快。而使用統計數據代替查詢將是所以很酷。

這裏有一個快速演示:

--Create a table with the IDs AA - ZZ. 
create table test(id varchar2(100), h number, n number, q varchar2(100) 
    ,other varchar2(100)); 

insert into test 
select letter1||letter2 letters, row_number() over (order by letter1||letter2), 1, 1, 1 
from 
    (select chr(65+level-1) letter1 from dual connect by level <= 26) letters1 
    cross join 
    (select chr(65+level-1) letter2 from dual connect by level <= 26) letters2 
; 
commit; 

--Gather stats, create a histogram with 11 buckets (we'll only use the first 10) 
begin 
    dbms_stats.gather_table_stats(user, 'TEST', cascade=>true, 
     method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 10 ID'); 
end; 
/

--Getting the values from user_histograms is kinda tricky, especially for varchars. 
--There are problems with rounding, so some of the values may not actually exist. 
-- 
--This query is from Jonathan Lewis: 
-- http://jonathanlewis.wordpress.com/2010/10/05/frequency-histogram-4/ 
select 
     endpoint_number, 
     endpoint_number - nvl(prev_endpoint,0) frequency, 
     hex_val, 
     chr(to_number(substr(hex_val, 2,2),'XX')) || 
     chr(to_number(substr(hex_val, 4,2),'XX')) || 
     chr(to_number(substr(hex_val, 6,2),'XX')) || 
     chr(to_number(substr(hex_val, 8,2),'XX')) || 
     chr(to_number(substr(hex_val,10,2),'XX')) || 
     chr(to_number(substr(hex_val,12,2),'XX')), 
     endpoint_actual_value 
from (
     select 
       endpoint_number, 
       lag(endpoint_number,1) over(
         order by endpoint_number 
       )              prev_endpoint, 
       to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val, 
       endpoint_actual_value 
     from 
       user_histograms 
     where table_name = 'TEST' 
     and  column_name = 'ID' 
     ) 
where 
     endpoint_number < 10 
order by 
     endpoint_number 
; 

這裏的直方圖結果與@Justin洞的查詢實際結果的比較:

Histogram: Real results: 
[email protected]   AA 
CP   CQ 
FF   FG 
HV   HW 
KL   KM 
NB   NC 
PR   PS 
SG   SH 
UU   UW 
XK   XL