2016-03-11 62 views
1

這裏我描述一個抽象案例,但它與我現在試圖解決的案例類似。我知道如何用PL/SQL塊獲得粗略的結果,但我不知道是否有人可以用單個選擇查詢來提出解決方案。提取特定數量的記錄以滿足某些總體條件

假設我們有一個表t_people與成千上萬的記錄描述了某些羣體的人具有以下屬性集:

  • id
  • age,數
  • 釐米height,數
  • gender,varchar2('男'或'女')

我們需要提取N個記錄,使得結果集滿足下列條件:選擇人的

  • 30%小於180釐米更高選擇人的
  • 60%是男性
  • 40選定人員的百分比大於40

我們也可以假設N遠小於表中的總行數,問題是可以解決的。

你會如何建議用一個選擇查詢來做到這一點?

感謝

+0

您可以使用8個查詢的UNION ALL--第一個返回的N * .3 * .6 * .4高於180釐米的人,男性超過40,下一個返回的N * .3 * .6 * 。身高超過180釐米,男,40歲以下的人有6人? –

回答

3

你會分層數據分成8組,然後採取按比例取樣從每個組,以滿足您的要求。一個粗略的方法是將條件轉換爲組,說:

  • 300人比180更高,而不是男性,年齡不
  • 100人短,不男性,年齡不
  • 400人短,男性,老年
  • 200人短,男性,年齡不

然後,你可以爲解決這個:

with p as (
     select p.*, 
      row_number() over (partition by height, male, age order by height) as seqnum 
     from (select p.*, 
        (case when height > 180 then 1 else 0 end) as height, 
        (case when gender = 'male' then 1 else 0 end) as male, 
        (case when age > 40 then 1 else 0 end) as age 
      from people p 
      ) p 
    ) 
select p.* 
from p 
where (height = 1 and male = 0 and age = 0 and seqnum <= 300) or 
     (height = 0 and male = 0 and age = 0 and seqnum <= 100) or 
     (height = 0 and male = 1 and age = 1 and seqnum <= 400) or 
     (height = 0 and male = 1 and age = 0 and seqnum <= 200); 

還有另外一種方法,您可以在平均填充8個桶的位置使用,以跟蹤每個維度(年輕/年老,男性/女性,更短/更高)的數字。然後當第一個維度被填充時,你停止填充,並繼續填充4個互補單元。重複這個過程直到你有所需的數字。

+0

非常感謝,你的第一種方法會適合我。雖然在我的真實情況下,我有9個分組屬性,但我認爲你的想法在那裏仍然是可接受的。至於第二種方法,我想到了類似的東西,但略有不同。首先,我要隨機選擇N行並查看比例。雖然比例不夠好,但我會從剩下的一組中更好地替換壞記錄。如果記錄更接近想要的比例,記錄會更好。再次,非常感謝,祝你週末愉快! –

0

我終於選擇了第一個方法suggestedGordon Linoff和一些小修改。我保留了原來的想法,但還引入了幾個額外的子查詢來指定組內記錄的所需分佈,並構建了一個具有每組所需記錄數的矩陣。還有全局參數部分,其中包含唯一的參數來指定整體記錄計數。

查詢生成非常有用的結果:

with 
    people as (
     select id, 
       floor(months_between(sysdate, date_birth)/12) age, 
       195 - least(floor(months_between(sysdate, date_birth)/12), 50) height, 
       decode(sex, 1, 'male', 'female') gender 
     from my_people_table 
     where date_birth is not null and rownum < 100000 
    ), 
    params as (/* Global params */ 
     select 100 rec_count -- total record count 
     from dual 
    ), 
    age_groups as ( /* distribution by height */ 
     select 'group 1' age_group, .7 prc from dual union 
     select 'group 2' age_group, .3 prc from dual 
    ), 
    height_groups as (/* distribution by height */ 
     select 'group 1' height_group, .6 prc from dual union 
     select 'group 2' height_group, .4 prc from dual 
    ), 
    genders as (  /* distribution by gender */ 
     select 'male' gender, .6 prc from dual union 
     select 'female' gender, .4 prc from dual 
    ), 
    mx as (   /* a matrix with record counts per group */ 
     select age_group, height_group, gender, 
       ceil(
        age_groups.prc * 
        height_groups.prc * 
        genders.prc * 
        rec_count 
       ) rec_count  
     from age_groups, height_groups, genders, params 
    ), 
    xpeople as (  /* Minor transformations - groups and group counters */ 
     select p.*, 
       row_number() over (
        partition by age_group, height_group, gender 
         order by age_group, height_group, gender 
       ) rec_num 
     from (        
       select people.*, 
         case 
          when age <= 40 then 'group 1' 
               else 'group 2' 
         end age_group, 
         case 
          when height <= 180 then 'group 1' 
               else 'group 2' 
         end height_group 
       from people 
     ) p 
    ) 
/* the resulting query uses the matrix to filter the records */  
select xpeople.* 
from xpeople join mx 
      on xpeople.age_group = mx.age_group 
      and xpeople.height_group = mx.height_group  
      and xpeople.gender = mx.gender 
      and xpeople.rec_num <= mx.rec_count 

感謝您的幫助!

相關問題