2016-01-25 40 views
1

有一個樣本數據集以CSV格式提供給我。 虛設數據集如下:如何消除SAS數據集中的重複條目?

Baseball1,Baseball2 
USA,France 
USA,Italy 
USA,England 
England,USA 
England,Australia 
England,Sri Lanka 
France,USA 
France,England 
France,Italy 
Italy,USA 
Italy,France 
Italy,England 

我需要得到其中數據只具有descrete值的輸出數據。 所需的輸出是這樣的:

Baseball1 Baseball2 
    USA  France 
    USA  Italy 
    USA  England 
England Australia 
England Sri Lanka 
France  England 
France  Italy 
Italy  England 

我認爲PROC SQL將在這裏工作,但我不知道如何去除不同列的重複條目。

+0

有趣的問題! –

回答

2

我認爲棘手的部分是你關心變量的水平順序,所以對於你來說法國/意大利和意大利/法國實際上形成了你想要刪除的重複。

請參見下面的注意事項與我的代碼是什麼它:

/* Reading data in */ 
data have; 
length baseball1 $ 9 baseball2 $ 9; 
infile datalines delimiter=','; 
input Baseball1 $ Baseball2 $ ; 
datalines; 
USA,France 
USA,Italy 
USA,England 
England,USA 
England,Australia 
England,Sri Lanka 
France,USA 
France,England 
France,Italy 
Italy,USA 
Italy,France 
Italy,England 
; 

/* horizontal sorting */ 
data sorted_arrays; 
set have; 
length Team1 $ 9 Team2 $ 9; 
/* Copying data into new vars to preserve original data for output */ 
Team1 = Baseball1; 
Team2 = Baseball2; 
/* Sorting data horizontally with sortc call */ 
call sortc(Team1,Team2); 
/* Creating an ID by concatenating sorted variables */ 
ID = (CATX("/",Team1,Team2)); 
/* Preserving original order */ 
order = _N_; 
run; 

/* Removing duplicates by ID and keeping required variables*/ 
PROC SORT data=sorted_arrays out=no_dupes(keep=baseball1 baseball2 order) NODUPKEY; 
BY ID; 
RUN; 

/* Returning to original order to achieve the result needed */ 
PROC SORT data=no_dupes out=want(drop=order); 
by order; 
run; 

/* Final Report*/ 
PROC PRINT data=want; 
RUN; 

結果:

enter image description here

如果變量最終水平/垂直順序並不重要,可以簡化代碼如下所示,您可以使用PROC SQL:

/* Reading data in */ 
data have; 
length baseball1 $ 9 baseball2 $ 9; 
infile datalines delimiter=','; 
input Baseball1 $ Baseball2 $ ; 
/* horizontal sorting */ 
call sortc(Baseball1,Baseball2); 
datalines; 
USA,France 
USA,Italy 
USA,England 
England,USA 
England,Australia 
England,Sri Lanka 
France,USA 
France,England 
France,Italy 
Italy,USA 
Italy,France 
Italy,England 
; 

/*Remove dupes */ 
PROC SQL; 
    CREATE TABLE want AS 
    SELECT DISTINCT t1.baseball1, 
      t1.baseball2 
     FROM WORK.HAVE t1; 
QUIT; 


/* Final Report*/ 
PROC PRINT data=want; 
RUN; 

結果:

enter image description here

/* Reading data in */ 
data have (drop=tmp); 
    length baseball1 $ 9 baseball2 $ 9 tmp $9; 
    infile datalines delimiter=','; 
    input Baseball1 $ Baseball2 $; 

    /* horizontal sorting */ 
    if Baseball1>Baseball2 then 
     do; 
      tmp = Baseball1; 
      Baseball1=Baseball2; 
      Baseball2 = tmp; 
     end; 

    datalines; 
USA,France 
USA,Italy 
USA,England 
England,USA 
England,Australia 
England,Sri Lanka 
France,USA 
France,England 
France,Italy 
Italy,USA 
Italy,France 
Italy,England 
; 

/*Remove dupes */ 
PROC SQL; 
    CREATE TABLE want AS 
     SELECT DISTINCT t1.baseball1, 
      t1.baseball2 
     FROM WORK.HAVE t1; 
QUIT; 

/* Final Report*/ 
PROC PRINT data=want; 
RUN; 

相同的結果前面的例子:

enter image description here

+0

嗯,我根本不關心這個命令。它可以讓法國美國或美國法國。任何人都可以進入。謝謝 –

+0

您已經在您的問題中放置了一個期望的數據集,該數據集的順序水平和垂直保存。如果你不關心訂單,那麼這些代碼仍然適用於你。你可以通過刪除一半的id來簡化它。我會修改我的答案以表明我的意思。 –

+0

如果我不想使用CALL Sortc函數,該怎麼辦? –