2017-04-04 371 views
1

我正在使用SAS來處理大型數據集(> 20gb)。當我運行一個DATA步驟時,雖然我用相同的變量對數據集進行了排序,但我收到了「BY變量未正確排序......」。當我跑了PROC SORT再次,SAS甚至說「輸入數據集已經排序,沒有進行排序」 我的代碼是:「BY變量未正確排序」錯誤雖然它已被排序

proc sort data=output.TAQ; 
    by market ric date miliseconds descending type order; 
run; 

options nomprint; 

data markers (keep=market ric date miliseconds type order); 
    set output.TAQ; 
    by market ric date; 

    if first.date; 

    * ie do the following once per stock-day; 
    * Make 1-second markers; 

    /*Type="AMARK"; Order=0; * Set order to zero to ensure that markers get placed before trades and quotes that occur at the same milisecond; 
    do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;*/ 
run; 

和錯誤信息是:

ERROR: BY variables are not properly sorted on data set OUTPUT.TAQ. 
RIC=CXR.CCP Date=20160914 Time=13:47:18.125 Type=Quote Price=. Volume=. BidPrice=9.03 BidSize=400 
AskPrice=9.04 AskSize=100 Qualifiers= order=116458952 Miliseconds=49638125 exchange=CCP market=1 
FIRST.market=0 LAST.market=0 FIRST.RIC=0 LAST.RIC=0 FIRST.Date=0 LAST.Date=1 i=. _ERROR_=1 
_N_=43297873 
NOTE: The SAS System stopped processing this step because of errors. 
NOTE: There were 43297874 observations read from the data set OUTPUT.TAQ. 
WARNING: The data set WORK.MARKERS may be incomplete. When this step was stopped there were 
     56770826 observations and 6 variables. 
WARNING: Data set WORK.MARKERS was not replaced because this step was stopped. 
NOTE: DATA statement used (Total process time): 
     real time   1:14.21 
     cpu time   26.71 seconds 
+0

你在日誌中收到一條錯誤消息,當您運行PROC排序? – user667489

+0

絕對需要查看更多的日誌。觀察計數非常奇怪 - 你有'如果first.date',所以標記應該是output.taq的一個子集,但是在處理停止的地方,已經從output.taq和〜56.8中讀取了〜43.3m obs m已被寫入__ work_markers ... – keydemographic

+0

@keydemographic在循環內有一個輸出語句,所以obs計數可能會做各種各樣的事情。 – user667489

回答

0

如果源數據集位於數據庫中,它可能按不同的排序規則排序。

您的排序之前嘗試以下操作:

options sortpgm=sas; 
1

的錯誤發生深到你的數據的步驟,在_N_=43297873。這表明PROC SORT正在努力達到一定程度,但後來失敗了。不知道你的SAS環境或如何存儲OUTPUT.TAQ很難知道原因。

有些人在排序大型數據集時報告了資源問題或文件系統限制。

SAS FAQ: Sorting Very Large Datasets with SAS(不是官方來源):

  • 當在WORK文件夾排序,你必須有可用的存儲等於4倍的數據集的大小(或5倍,如果在Unix下)

  • 您可能正在運行的RAM

  • 您可以使用選項MSGLEVEL=iFULLSTIMER得到一個更全面的瞭解

也使用options sastraceloc=saslog;可以產生有用的消息。

也許,而不是排序,你可以把它分解成幾個步驟,是這樣的:

/* Get your market ~ ric ~ date pairs */ 
proc sql; 
    create table market_ric_date as 
    select distinct market, ric, date 
    from output.TAQ 
    /* Possibly an order by clause here on market, ric, date */ 
; quit; 

data millisecond_stuff; 
    set market_ric_date; 
    *Possibly add type/order in this step as well?; 
    do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end; 
run; 

/* Possibly a third step here to add type/order if you need to get from original data source */