2012-05-25 115 views
0

我想與R使用KDD杯99數據集,但不幸的是,我得到非常糟糕的結果。基本上,預測器是猜測(交叉驗證集約50%的錯誤)。我的代碼中可能存在一個錯誤,但我找不到位置。使用KDD杯99數據集和機器學習與R

KDD cup 99數據集由大約4百萬個例子組成,這些例子分爲四類不同類型的攻擊+「正常」類。首先,我將數據集分成5個文件(每個類一個+「正常」類一個),並將非數字數據轉換爲數字數據。目前,我正在研究「遠程到本地」(r2l)類。根據關於該主題的論文結果選擇一些功能。之後,我抽取了大量等於r2l實例數量的「正常」實例,以避免類錯誤的問題。我還用標籤「attack」替換了所有不同類型r2l攻擊的標籤,這樣我就可以訓練一個兩級分類器。然後,我將該樣本加入到一個新數據集中的r2l實例中。最後,我申請了10倍交叉驗證來評估我的模型,它是利用SVM構建和我得到了機器學習的歷史上最壞的結果... :(

這裏是我的代碼:

r2l <- read.table("kddcup_r2l.data",sep=",",header=T) 
#u2r <- read.table("kddcup_u2r.data",sep=",",header=T) 
#probe_original <- read.table("kddcup_probe.data",sep=",",header=T) 
#dos <- read.table("kddcup_dos.data",sep=",",header=T) 
normal <- read.table("kddcup_normal.data",sep=",",header=T) 

#probe <- probe_original[sample(1:dim(probe_original)[1],10000),] 

# Features selected by the three algorithms svm, lgp and mars 
# for the different classes of attack 
######################################################################## 

features.r2l.svm <- c("srv_count","service","duration","count","dst_host_count") 
features.r2l.lgp <- c("is_guest_login","num_access_files","dst_bytes","num_failed_logins","logged_in") 
features.r2l.mars <- c("srv_count","service","dst_host_srv_count","count","logged_in") 
features.r2l.combined <- unique(c(features.r2l.svm,features.r2l.lgp,features.r2l.mars)) 



#  Sample the training set containing the normal labels 
#  for each class of attack in order to have the same number 
#  of training data belonging to the "normal" class and the 
#  "attack" class 
####################################################################### 

normal_sample.r2l <- normal[sample(1:dim(normal)[1],dim(r2l)[1]),] 


# This part was useful before the separation normal/attack because 
# attack was composed of different types for each class 
###################################################################### 

normal.r2l.Y <- matrix(normal_sample.r2l[,c("label")]) 


####################################################################### 
#  Class of attack Remote to Local (r2l) 
####################################################################### 

# Select the features according to the algorithms(svm,lgp and mars) 
# for this particular type of attack. Combined contains the 
# combination of the features selected by the 3 algorithms 
####################################################################### 
#features.r2l.svm <- c(features.r2l.svm,"label") 
r2l_svm <- r2l[,features.r2l.svm] 
r2l_lgp <- r2l[,features.r2l.lgp] 
r2l_mars <- r2l[,features.r2l.mars] 
r2l_combined <- r2l[,features.r2l.combined] 
r2l_ALL <- r2l[,colnames(r2l) != "label"] 

r2l.Y <- matrix(r2l[,c("label")]) 
r2l.Y[,1] = "attack" 



# Merge the "normal" instances and the "r2l" instances and shuffle the result 
############################################################################### 

r2l_svm.tr <- rbind(normal_sample.r2l[,features.r2l.svm],r2l_svm) 
r2l_svm.tr <- r2l_svm.tr[sample(1:nrow(r2l_svm.tr),replace=F),] 
r2l_lgp.tr <- rbind(normal_sample.r2l[,features.r2l.lgp],r2l_lgp) 
r2l_lgp.tr <- r2l_lgp.tr[sample(1:nrow(r2l_lgp.tr),replace=F),] 
r2l_mars.tr <- rbind(normal_sample.r2l[,features.r2l.mars],r2l_mars) 
r2l_mars.tr <- r2l_mars.tr[sample(1:nrow(r2l_mars.tr),replace=F),] 
r2l_ALL.tr <- rbind(normal_sample.r2l[,colnames(normal_sample.r2l) != "label"],r2l_ALL) 
r2l_ALL.tr <- r2l_ALL.tr[sample(1:nrow(r2l_ALL.tr),replace=F),] 

r2l.Y.tr <- rbind(normal.r2l.Y,r2l.Y) 
r2l.Y.tr <- matrix(r2l.Y.tr[sample(1:nrow(r2l.Y.tr),replace=F),]) 

####################################################################### 
# 
#  10-fold CROSS-VALIDATION to assess the models accuracy 
# 
####################################################################### 

# CV for Remote to Local 
########################  
cv(r2l_svm.tr, r2l_lgp.tr, r2l_mars.tr, r2l_ALL.tr, r2l.Y.tr) 

和交叉驗證功能:

cv <- function(svm.tr, lgp.tr, mars.tr, ALL.tr, Y.tr){ 

Jcv.svm_mean <- NULL 

#Compute the size of the cross validation 
# ======================================= 
index=sample(1:dim(svm.tr)[1]) 
size.CV<-floor(dim(svm.tr)[1]/10) 

Jcv.svm <- NULL 

#Start 10-fold Cross validation 
# ============================= 
for (i in 1:10) { 
    # if m is the size of the training set 
    # (nr of rows in svm.tr for example) 
    # take n observations for test and (m-n) for training 
    # with n << m (here n = m/10) 
    # =================================================== 
    i.ts<-(((i-1)*size.CV+1):(i*size.CV)) 
    i.tr<-setdiff(index,i.ts) 

    Y.tr.tr <- as.factor(Y.tr[i.tr])  
    Y.tr.ts <- as.factor(matrix(Y.tr[i.ts],ncol=1)) 

    svm.tr.tr <- svm.tr[i.tr,] 
    svm.tr.ts <- svm.tr[i.ts,] 


    # Get the model for the algorithms 
    # ============================================== 


    model.svm <- svm(Y.tr.tr~.,svm.tr.tr,type="C-classification") 

    # Compute the prediction 
    # ============================================== 
    Y.hat.ts.svm <- predict(model.svm,svm.tr.ts) 

    # Compute the error 
    # ============================================== 

    h.svm <- NULL 

    h.svm <- matrix(Y.hat.ts.svm,ncol=1) 

    Jcv.svm <- c(Jcv.svm ,sum(!(h.svm == Y.tr.ts))/size.CV) 
    print(table(h.svm,Y.tr.ts)) 

} 

Jcv.svm_mean <- c(Jcv.svm_mean, mean(Jcv.svm)) 

d <- 10 
print(paste("Jcv.svm_mean: ", round(Jcv.svm_mean,digits=d))) 
} 

我得到很奇怪的結果看來,該算法並沒有真正看到的情況有什麼區別它看起來像一個猜測比預測更我。也嘗試了攻擊類「探測器」,但獲得了相同的結果我前面提到的論文在等級r2l上爲30%,在探頭上爲60-98%(取決於多項式等級)。

下面是10倍交叉驗證中的一個預測:

h.svm(攻擊)& Y.tr.ts(攻擊) - > 42個實例

小時。 SVM(攻擊)& Y.tr.ts(正常) - (。正常)> 44個實例

h.svm & Y.tr.ts(攻擊) - > 71個實例

小時。 svm(normal。)& Y.tr.ts(normal。) - > 68 insta nces

如果有人能告訴我我的代碼出了什麼問題,我將非常感激。

預先感謝您

+0

顯然,沒有人似乎回答...是否因爲我的問題沒有很好地形成?還是因爲沒有人看到什麼是錯的? – Alex

+0

這屬於[DataScience.SE](http://datascience.stackexchange.com),但現在太老,無法遷移。推薦你在那裏試試。 – smci

回答