2013-07-06 156 views
0

我有一個名爲coaches_assistants的SAS數據集,其結構如下。每TeamID總是隻有兩條記錄。從SAS中的多個記錄創建單個記錄

TeamID  Team_City CoachCode 
123  Durham  Head_242 
123  Durham  Assistant_876 
124  London  Head_876 
124  London  Assistant_922 
125  Bath   Head_667 
125  Bath   Assistant_786 
126  Dover  Head_544 
126  Dover  Assistant_978 
...  ...   .... 

我想如何處理此要做的是創建一個數據集有一個額外的字段名爲AssistantCode,並使它看起來像:

TeamID  Team_City HeadCode AssistantCode 
123  Durham  242  876 
124  London  876  922 
125  Bath   667  786 
126  Dover  544  978 
...  ...   ...  ... 

如果可能的話,我願意做的這在一個單獨的DATA步驟中(儘管我知道我可能首先需要一個PROC SORT步驟)。我知道如何用python或ruby或任何傳統的腳本語言來做到這一點,但我不知道如何在SAS中做到這一點。

這樣做的最好方法是什麼?

回答

1

這裏有兩個可能的解決方案(一個使用的請求數據的步驟和另一使用PROC SQL):

data have; 
    length TeamID $3 Team_City CoachCode $20; 
    input TeamID $ Team_City $ CoachCode $; 
    datalines; 
123  Durham  Head_242 
123  Durham  Assistant_876 
124  London  Head_876 
124  London  Assistant_922 
125  Bath   Head_667 
125  Bath   Assistant_786 
126  Dover  Head_544 
126  Dover  Assistant_978 
run; 

/* A data step solution */ 
proc sort data=have; 
    by TeamID; 
run; 

data want1(keep=TeamID Team_City HeadCode AssistantCode); 
    /* Define all variables, retain the new ones */ 
    length TeamID $3 Team_City $20 HeadCode $3 AssistantCode $3; 
    retain HeadCode AssistantCode; 
    set have; 
     by TeamID; 
    if CoachCode =: 'Head' 
     then HeadCode = substr(CoachCode,6,3); 
     else AssistantCode = substr(CoachCode,11,3); 
    if last.TeamID; 
run; 

/* An SQL solution */ 
proc sql noprint; 
    create table want2 as 
    select TeamID 
     , max(Team_City) as Team_City 
     , max(CASE WHEN CoachCode LIKE 'Head%' 
        THEN substr(CoachCode,6,3) ELSE ' ' 
       END) LENGTH=3 as HeadCode 
     , max(CASE WHEN CoachCode LIKE 'Assistant%' 
        THEN substr(CoachCode,11,3) ELSE ' ' 
       END) LENGTH=3 as AssistantCode 
    from have 
    group by TeamID; 
quit; 

PROC SQL有沒有要求你提前對數據進行排序的優勢。

+0

感謝您的詳細信息。我採用第一種方法,它像一個魅力。我還沒有深入研究SAS SQL,所以稍後我會在有機會的時候看看它。 – Clay

+0

在第一個代碼示例中,「IF LAST.TeamID」是做什麼的? – Clay

+1

在數據步驟處理中使用「BY」語句時,會自動創建特殊變量以協助處理。爲語句中列出的每個變量創建兩個「點」變量:「FIRST.variable」和「LAST.variable」,它們標識組中obs的相對位置。 '如果LAST.TeamID;'是一個「subsetting-IF」語句,用於每個TeamID只輸出一個obs。 – BellevueBob

0

這裏假設你已經按teamID對數據進行了排序,並且總教練總是來到助理面前。警告:未經測試(我真的需要重新獲得訪問SAS ....)

data want (drop=nc coachcode); 
    set have; 
    length headcode assistantcode $3; 
    retain headcode; 
    by teamid; 
    nc = length(coachcode); 
    if substr(coachcode, 1, 4) = 'Head' then 
     headcode = substr(coachcode, nc-2, nc); 
    else 
     assistantcode = substr(coachcode, nc-2, nc); 
    if last.teamid; 
run; 
+1

SCAN可能比子串更好:) – Joe

+0

@Joe謝謝,忘了那個。 –

2

雖然可以在一個datastep做,我通常發現,這類問題在PROC TRANSPOSE是更好的服務。用這種方式減少手動編碼,爲新事物提供更大的靈活性(比如說出現了一個新的「HeadAssistant」值,這會立即起作用)。

data have; 
length coachcode $25; 
input TeamID  Team_City $ CoachCode $; 
datalines; 
123  Durham  Head_242 
123  Durham  Assistant_876 
124  London  Head_876 
124  London  Assistant_922 
125  Bath   Head_667 
125  Bath   Assistant_786 
126  Dover  Head_544 
126  Dover  Assistant_978 
;;;; 
run; 

data have_t; 
set have; 
id=scan(coachcode,1,'_'); 
val = scan(coachcode,2,'_'); 
keep teamId team_city id val; 
run; 

proc transpose data=have_t out=want(drop=_name_); 
by teamID team_city; 
id id; 
var val; 
run; 
+0

我喜歡這看起來很乾淨,但是當我試圖用我的數據集(有120,000多個觀察值)運行它時,出於某種原因,創建了「不希望」表。我稍後再試。謝謝! – Clay