2012-10-17 45 views
5

我正在尋找一些天才SQL幫助,我遇到了一個棘手的統計問題。SQL統計抽樣

我正在做的是從一組不平衡的用戶配置文件中拉出一個統計平衡的樣本。一次爲單個配置文件屬性(例如性別)執行此操作將會有點簡單。但是,要立即在多個維度上做到這一點需要一些複雜性。

爲了論證的緣故,讓我們說我有這張表。

Profile.userID 
Profile.Gender 
Profile.Age 
Profile.Income 

如果我想拉型材池出來混的,使用戶的新採樣大致匹配所有的以下特徵:

50% male, 50% female 
30% young, 40% middle age, 40% old 
40% low income, 40% middle income, 20% high income 

任何人都不會有如何的任何想法把這個關掉?

+1

是什麼阻止您隨機抽取一個記錄,直到樣本集滿足您的要求? –

+0

我該如何防止它不斷失衡?假設我只需要一個女性唱片,但是拉動這個唱片會讓我的年齡和收入失去平衡......? – tbacos

+2

年輕30%,中年40%,年齡40%!= 100% 您的範圍內年輕人和中年人之間是否存在重疊? –

回答

3

你有什麼是抽樣問題。解決這個問題的關鍵是將數據分解成三個變量組合的單獨組。然後,計算每個組的邊際概率的乘積(您的值是邊際概率)。然後,對所有18個組進行標準化。

例如,Male-Young-Low組將獲得0.5 * 0.3 * 0.4 = 0.06的值。您對所有18個組重複此操作,然後歸一化爲一個百分比(即將每個值除以所有值的總和)。結果如下:

Gender Age  Income Marg Normalized 
Male Young Low  0.06 5.5% 
Male Young Middle 0.06 5.5% 
Male Young High 0.03 2.7% 
Male Middle Low  0.08 7.3% 
Male Middle Middle 0.08 7.3% 
Male Middle High 0.04 3.6% 
Male Old  Low  0.08 7.3% 
Male Old  Middle 0.08 7.3% 
Male Old  High 0.04 3.6% 
Female Young Low  0.06 5.5% 
Female Young Middle 0.06 5.5% 
Female Young High 0.03 2.7% 
Female Middle Low  0.08 7.3% 
Female Middle Middle 0.08 7.3% 
Female Middle High 0.04 3.6% 
Female Old  Low  0.08 7.3% 
Female Old  Middle 0.08 7.3% 
Female Old  High 0.04 3.6% 

然後,這將成爲您的每個組的採樣率。下面是僞SQL代碼實際上做的抽樣:

with SamplingRates (
    select 'Male' as gender, 'Young' as Age, 'Low' as income, 0.045 as SamplingRate, 
    union all . . 
) 
select t.* 
from (select t.*, 
      row_number() over (partition by gender, age, income order by <random>) as seqnum, 
      count(*) over (partition by gender, age, income) as NumRecs 
     from table t 
    ) t join 
    SampleRates sr 
    on t.gender = sr.gender and t.age = sr.age and t.income = sr.income and 
     seqnum <= sr.SamplingRate * NumRecs 
0

這是我怎麼會去一下,假設: 30%年輕的,40%的中年,30%的老

以最少的共同點,您的池大小= 5x5x3x4x2x4 = 2400

您有18個查詢將您的池填充到TEMP TABLE中。重複所有18個查詢,爲您提供更大的池。以下是理想池的分佈情況以及每個查詢的外觀。你也可以在每個查詢中引入一些隨機性。之前有一篇關於這樣做的文章。

這可能不那麼優雅,但應該產生平衡池。

你在僞第一個查詢看起來像:

SELECT * INTO TEMP TABLE 
WHERE male, young, high income and ID NOT IN TEMP TABLE 
LIMIT RECORD SET 72 

等等等等。希望能幫助到你。好問題,但。

CREATE TEMP TABLE 
480 high income 
    144 young 
     72 males [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 72] 
     72 females [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 72] 
    192 middle age 
     96 males [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 96] 
     96 females [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 96] 
    144 old 
     72 males [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 72] 
     72 females [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 72] 

960 middle income 
    288 young 
     144 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144] 
     144 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144] 
    384 middle age 
     192 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 192] 
     192 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 192] 
    288 old 
     144 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144] 
     144 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144] 

960 low income 
    288 young 
     144 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144] 
     144 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144] 
    384 middle age 
     192 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 192] 
     192 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 192] 
    288 old 
     144 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144] 
     144 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]