縱向數據無需替換的隨機抽樣

我的數據是縱向數據。縱向數據無需替換的隨機抽樣

VISIT ID VAR1 
1  001 ... 
1  002 ... 
1  003 ... 
1  004 ... 
... 
2  001 ... 
2  002 ... 
2  003 ... 
2  004 ...

我們的最終目標是挑選每次訪問10％進行測試。我嘗試使用prov SURVEYSELECT來做SRS而無需替換，並使用「VISIT」作爲分層。但最終的樣本會有重複的ID。例如，可以在VISIT = 1和VISIT = 2中選擇ID = 001。

有沒有辦法使用SURVEYSELECT或其他程序（R也很好）？非常感謝。

來源

2017-09-15 Sailynette Garcia

所以你想從每次訪問中獲取10％，但是最終數據集中的所有ID都應該是唯一的？ – useR

是的。正如你所說。 –

只要ID是唯一的訪問，你可以使用AVE：'$逸拿起< - AVE（is.numeric（DAT $ VISIT），DAT $參觀，樣品（C（TRUE，FALSE），長度（X）， probs = c（.1，.9），replac = TRUE））'。 – lmo

這是可能的一些相當有創意的數據步驟編程。下面的代碼使用一個貪婪的方法，依次從每次訪問採樣，採樣只是以前沒有被抽樣的ID。如果訪問中超過90％的ID已被抽樣，則不到10％。在極端情況下，當訪問的每個ID已被採樣時，不會輸出該訪問的行。

/*Create some test data*/ 
data test_data; 
    call streaminit(1); 
    do visit = 1 to 1000; 
    do id = 1 to ceil(rand('uniform')*1000); 
     output; 
    end; 
    end; 
run; 


data sample; 
    /*Create a hash object to keep track of unique IDs not sampled yet*/ 
    if 0 then set test_data; 
    call streaminit(0); 
    if _n_ = 1 then do; 
    declare hash h(); 
    rc = h.definekey('id'); 
    rc = h.definedata('available'); 
    rc = h.definedone(); 
    end; 
    /*Find out how many not-previously-sampled ids there are for the current visit*/ 
    do ids_per_visit = 1 by 1 until(last.visit); 
    set test_data; 
    by visit; 
    if h.find() ne 0 then do; 
     available = 1; 
     rc = h.add(); 
    end; 
    available_per_visit = sum(available_per_visit,available); 
    end; 
    /*Read through the current visit again, randomly sampling from the not-yet-sampled ids*/ 
    samprate = 0.1; 
    number_to_sample = round(available_per_visit * samprate,1); 
    do _n_ = 1 to ids_per_visit; 
    set test_data; 
    if available_per_visit > 0 then do; 
     rc = h.find(); 
     if available = 1 then do; 
     if rand('uniform') < number_to_sample/available_per_visit then do; 
      available = 0; 
      rc = h.replace(); 
      samples_per_visit = sum(samples_per_visit,1); 
      output; 
      number_to_sample = number_to_sample - 1; 
     end; 
     available_per_visit = available_per_visit - 1; 
     end; 
    end; 
    end; 
run; 

/*Check that there are no duplicate IDs*/ 
proc sort data = sample out = sample_dedup nodupkey; 
by id; 
run;

來源

2017-10-12 15:23:20 user667489

縱向數據無需替換的隨機抽樣

回答

相關問題