2015-04-25 61 views
3

對於使用R/Python的1-2%的樣本數據,我有一個適合的機器學習分類器,我對精度測量(精度,召回率和F_score)非常滿意。對一個非常大的數據集進行評分

現在我想得分了巨大的數據庫,70萬行/與這個分類這是在R.編碼

信息有關數據集駐留在Hadoop的/蜂房環境實例:

70萬元X 40個變量(列):大約18個變量是分類的,其餘22個是數字(包括整數)

我該如何去做呢?有什麼建議麼 ?

我曾想過做的事情是:

一)組塊了在1M增量數據從CSV文件中Hadoop的系統和餵養它與R

二)某種類型批 - 處理。

它不是一個實時系統,所以不需要每天都進行,但我仍然想在2-3小時內對它進行評分。

回答

1

如果你能在所有的數據節點安裝R運行時,你可以將調用將R代碼的簡單hadoop streaming地圖唯一的工作

您也可以看看SparkR

1

我推斷你想上運行一個完整的數據集的R代碼裏面(你的分類),而不是樣本數據集

因此,我們正在尋找一個大規模分佈式系統上執行R代碼裏面

而且,它必須有一個tigh t與hadoop組件集成。

所以RHadoop將適合您的問題陳述。

http://www.rdatamining.com/big-data/r-hadoop-setup-guide

+0

分類器使用樣本數據集構建 - 即只有約1%的數據。但我會研究RHadoop。 –

0
The scoring of 80 million to 8.5 seconds 

The code below was run on an off lease Dell T7400 workstation with 64gb ram, dual quad 3ghz XEONS and two raid 0 SSD arrays on separate channels which I purchased for $600. I also use the free SPDE to partition the dataset. 

For small datasets like your 80 million you might want to consider SAS or WPS. 
The code below scores 80 million 40 char records in 9 seconds 

The combination of in memory R and SAS/WPS makes a great combinations. Many SAS users consider datasets less than 1TB to be small. 

I ran 8 parallel processes, SAS 9.4 64bit Win Pro 64bit 

8.5 

%let pgm=utl_score_spde; 

proc datasets library=spde; 
delete gig23ful_spde; 
run;quit; 

libname spde spde 'd:/tmp' 
    datapath=("f:/wrk/spde_f" "e:/wrk/spde_e" "g:/wrk/spde_g") 
    partsize=4g; 
; 

data spde.littledata_spde (compress=char drop=idx); 
    retain primary_key; 
    array num[20] n1-n20; 
    array chr[20] $4 c1-c20; 
    do primary_key=1 to 80000000; 
    do idx=31 to 50; 
     num[idx-30]=uniform(-1); 
     chr[idx-30]=repeat(byte(idx),40); 
    end; 
    output; 
    end; 
run;quit; 



%let _s=%sysfunc(compbl(C:\Progra~1\SASHome\SASFoundation\9.4\sas.exe -sysin c:\nul -nosplash -sasautos c:\oto -autoexec c:\oto\Tut_Oto.sas)); 

* score it; 


data _null_;file "c:\oto\utl_scoreit.sas" lrecl=512;input;put _infile_;putlog _infile_; 
cards4; 
%macro utl_scoreit(beg=1,end=10000000); 

    libname spde spde 'd:/tmp' 
    datapath=("f:/wrk/spde_f" "e:/wrk/spde_e" "g:/wrk/spde_g") 
    partsize=4g; 

    libname out "G:/wrk"; 

    data keyscore; 

    set spde.littledata_spde(firstobs=&beg obs=&end 
     keep= 
      primary_key 
      n1 
      n12 
      n3 
      n14 
      n5 
      n16 
      n7 
      n18 
      n9 
      n10 
      c18 
      c19 
      c12); 
    score= (.1*n1 + 
      .1*n12 + 
      .1*n3 + 
      .1*n14 + 
      .1*n5 + 
      .1*n16 + 
      .1*n7 + 
      .1*n18 + 
      .1*n9 + 
      .1*n10 + 
      (c18='0000') + 
      (c19='0000') + 
      (c12='0000'))/3 ; 
    keep primary_key score; 
    run; 

%mend utl_scoreit; 
;;;; 
run;quit; 

%utl_scoreit; 


%let tym=%sysfunc(time()); 
systask kill sys101 sys102 sys103 sys104 sys105 sys106 sys107 sys108; 
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=1,end=10000000);) -log G:\wrk\sys101.log" taskname=sys101; 
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=10000001,end=20000000);) -log G:\wrk\sys102.log" taskname=sys102 ; 
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=20000001,end=30000000);) -log G:\wrk\sys103.log" taskname=sys103 ; 
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=30000001,end=40000000);) -log G:\wrk\sys104.log" taskname=sys104 ; 
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=40000001,end=50000000);) -log G:\wrk\sys105.log" taskname=sys105 ; 
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=50000001,end=60000000);) -log G:\wrk\sys106.log" taskname=sys106 ; 
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=60000001,end=70000000);) -log G:\wrk\sys107.log" taskname=sys107 ; 
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=70000001,end=80000000);) -log G:\wrk\sys108.log" taskname=sys108 ; 
waitfor _all_ sys101 sys102 sys103 sys104 sys105 sys106 sys107 sys108; 
systask list; 
%put %sysevalf(%sysfunc(time()) - &tym); 

8.56500005719863 

NOTE: AUTOEXEC processing completed. 

NOTE: Libref SPDE was successfully assigned as follows: 
     Engine:  SPDE 
     Physical Name: d:\tmp\ 
NOTE: Libref OUT was successfully assigned as follows: 
     Engine:  V9 
     Physical Name: G:\wrk 

NOTE: There were 10000000 observations read from the data set SPDE.LITTLEDATA_SPDE. 
NOTE: The data set WORK.KEYSCORE has 10000000 observations and 2 variables. 
NOTE: DATA statement used (Total process time): 
     real time   7.05 seconds 
     cpu time   6.98 seconds 



NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414 
NOTE: The SAS System used: 
     real time   8.34 seconds 
     cpu time   7.36 seconds 
+0

我可以使用SAS,我的合作伙伴。是一家大型的分析店,但我如何將RandomForest或Naive-Bayes模型轉移到SAS生態系統中。 –

相關問題