2012-10-01 53 views
3

我想實現tomek鏈接處理不平衡數據。 此代碼用於二元分類問題,其中1類是多數類,0類是少數類。 X輸入,Y輸出 我寫了下面的代碼,但我正在尋找一種加速計算的方法。快速計算R中的Tomek鏈接

我該如何改進我的代碼?

######################### 
#remove overlapping observation using tomek links 
#given observations i and j belonging to different classes 
#(i,j) is a Tomek link if there is NO example z, such that d(i, z) < d(i, j) or d(j , z) < d(i, j) 
#find tomek links and remove only the observations of the tomek links belonging to majority class (0 class). 
######################### 
tomekLink<-function(X,Y,distType="euclidean"){ 
i.1<-which(Y==1) 
i.0<-which(Y==0) 
X.1<-X[i.1,] 
X.0<-X[i.0,] 
i.tomekLink=NULL 
j.tomekLink=NULL 
#i and j belong to different classes 
timeTomek<-system.time({ 
for(i in i.1){ 
    for(j in i.0){ 
     d<-dst(X,i,j,distType) 
     obsleft<-setdiff(1:nrow(X),c(i,j)) 
     for(z in obsleft){ 
      if (dst(X,i,z,distType)<d | dst(X,j,z,distType)<d){ 
       break() #(i,j) is not a Tomek link, get next pair (i,j) 
       } 
      #if z is the last obs and d(i, z) > d(i, j) and d(j , z) > d(i, j),then (i,j) is a Tomek link 
      if(z==obsleft[length(obsleft)]){ 
       if (dst(X,i,z,distType)>d & dst(X,j,z,distType)>d){ 
        #(i,j) is a Tomek link 
        #cat("\n tomeklink obs",i,"and",j) 
        i.tomekLink=c(i.tomekLink,i) 
        j.tomekLink=c(j.tomekLink,j) 
        #since we want to eliminate only majority class observations 
        #remove j from i.0 to speed up the loop 
        i.0<-setdiff(i.0,j) 
        } 
       } 
      } 
     } 
    } 
}) 
print(paste("Time to find tomek links:",round(timeTomek[3],digit=2))) 
#id2keep<-setdiff(1:nrow(X),c(i.tomekLink,j.tomekLink)) 
id2keep<-setdiff(1:nrow(X),j.tomekLink) 
cat("numb of obs removed usign tomeklink",nrow(X)-length(id2keep),"\n", 
    (nrow(X)-length(id2keep))/nrow(X)*100,"% of training ;", 
    (length(j.tomekLink))/length(which(Y==0))*100,"% of 0 class") 
X<-X[id2keep,] 
Y<-Y[id2keep] 
cat("\n prop of 1 afer TomekLink:",(length(which(Y==1))/length(Y))*100,"% \n") 
return(list(X=X,Y=Y)) 
} 


#distance measure used in tomekLink function 
dst<-function(X,i,j,distType="euclidean"){ 
d<-dist(rbind(X[i,],X[j,]), method= distType) 
return(d) 
} 

回答

0

我沒有測試過你的代碼,但從第一眼看來,似乎預分配會有所幫助。 不要使用i.tomekLink = c(i.tomekLink,i),而是嘗試分配內存來存儲Tomek鏈接。

另一個想法是計算從所有樣本到所有樣本的距離矩陣,並查看每個樣本的最近鄰居。如果它來自不同的課程,那麼你有一個tomek鏈接。