2013-10-03 53 views
0

說我有以下數據庫:SAS:通過數據塊填充缺失值

Min Rank Qty 
2 1 100 
2 2 90 
2 3 80 
2 4 70 
5 1 110 
5 2 100 
5 3 90 
5 4 80 
5 5 70 
7 1 120 
7 2 110 
7 3 100 
7 4 90 

我需要用連續的值的數據庫這樣分:

Min Rank Qty 
2 1 100 
2 2 90 
2 3 80 
2 4 70 
3 1 100 
3 2 90 
3 3 80 
3 4 70 
4 1 100 
4 2 90 
4 3 80 
4 4 70 
5 1 110 
5 2 100 
5 3 90 
5 4 80 
5 5 70 
6 1 110 
6 2 100 
6 3 90 
6 4 80 
6 5 70 
7 1 120 
7 2 110 
7 3 100 
7 4 90 

我如何在SAS中做到這一點?我只需要複製前一分鐘。每分鐘的觀察次數變化......它可以是4或5或更多。

+3

我看不出有什麼辦法知道在失蹤的分鐘內是否有4或5個觀測值。以及如何知道在什麼時候開始丟失分鐘的數量值?你剛剛複製了前一分鐘嗎? – mvherweg

+0

是的你的權利...只是複製前一分鐘 – Plug4

回答

0

想象這樣做的代碼並不難,問題在於它很快就會顯得雜亂無章。

如果數據集不是太大,一個方法,你可以考慮以下方法:

/* We find all gaps. the output dataset is a mapping: the data of which minute (reference_minute) do we need to create each minute of data*/ 
data MINUTE_MAPPING (keep=current_minute reference_minute); 
    set YOUR_DATA; 
    by min; 
    retain last_minute 2; *set to the first minute you have; 

    if _N_ NE 1 and first.min then do; 
     /* Find gaps, map them to the last minute of data we have*/ 
     if last_minute+1 < min then do; 
      do current_minute=last_minute+1 to min-1; 
       reference_minute=last_minute; 
       output; 
      end; 
     end; 

     /* For the available data, we map the minute to itself*/ 
     reference_minute=min; 
     current_minute=min; 
     output; 

     *update; 
     last_minute=min; 
    end; 
run; 

/* Now we apply our mapping to the data */ 
*you must use proc sql because it is a many-to-many join, data step merge would give a different outcome; 
proc sql; 
    create table RESULT as 
    select YD.current_minute as min, YD.rank, YD.qty 
    MINUTE_MAPPING as MM 
    join YOUR_DATA as YD 
    on (MM.reference_minute=YD.min) 
    ; 
quit; 

更高性能的方法將涉及與陣列掛羊頭賣狗肉。 但我覺得這種方法更有吸引力(免責聲明:一開始就想到),之後其他人更快地掌握(免責聲明:imho)。


良好的措施,陣列方法:

data RESULT (keep=min rank qty); 
    set YOUR_DATA; 
    by min; 
    retain last_minute; *assume that first record really is first minute; 
    array last_data{5} _TEMPORARY_; 

    if _N_ NE 1 and first.min and last_minute+1 < min then do; *gap found; 
     do current_min=last_minute+1 to min-1; 
      *store data of current record; 
      curr_min=min; 
      curr_rank=rank; 
      curr_qty=qty; 

      *produce records from array with last available data; 
      do iter=1 to 5; 
       min = current_minute; 
       rank = iter; 
       qty = last_data{iter}; 
       if qty NE . then output; *to prevent output of 5th element where there are only 4; 
      end; 

      *put back values of actual current record before proceeding; 
      min=curr_min; 
      rank=curr_rank; 
      qty=curr_qty; 
     end; 

     *update; 
     last_minute=min; 
    end; 

    *insert data for use on later missing minutes; 
    last_data{rank}=qty; 
    if last.min and rank<5 then last_data{5}=.; 

    output; *output actual current data point; 
run; 

希望它能幫助。 請注意,目前我無法訪問SAS客戶端。所以未經測試的代碼可能包含一些錯字。

+0

謝謝!我將在稍後嘗試您的代碼並回復您 – Plug4

+0

今天早上醒來,意識到我忘記了用數組方法填充數組。更新。 – mvherweg

0

除非你有一個荒謬的觀察數量,我認爲轉座會使這個很容易。

我目前無法訪問sas,因此忍受着我(如果您無法正常工作,我可以在明天測試它)。

proc transpose data=data out=data_wide prefix=obs_; 
    by minute; 
    id rank; 
    var qty; 
run; 

*sort backwards so you can use lag() to fill in the next minute; 
proc sort data=data_wide; 
    by descending minute; 
run; 

data data_wide; set data_wide; 
    nextminute = lag(minute); 
run; 

proc sort data=data_wide; 
    by minute; 
run; 

*output until you get to the next minute; 
data data_wide; set data_wide; 
    *ensure that the last observation is output; 
    if nextminute = . then output; 
    do until (minute ge nextminute); 
    output; 
    minute+1; 
    end; 
run; 

*then you probably want to reverse the transpose; 
proc transpose data=data_wide(drop=nextminute) 
       out=data_narrow(rename=(col1=qty)); 
    by minute; 
    var _numeric_; 
run; 

*clean up the observation number; 
data data_narrow(drop=_NAME_); set data_narrow; 
    rank = substr(_NAME_,5)*1; 
run; 

再次,我現在不能測試這個,但它應該工作。

其他人可能會有一個聰明的解決方案,使您不必逆向排序/滯後/前向排序。我覺得我之前已經處理過這個問題,但現在對我來說明顯的解決方案是,無論你做什麼之前的排序(可以用降序排序都沒有問題的轉置)來向後排序,以節省您的額外排序。