2015-08-15 101 views
6

我想在R中使用GA優化EPS迴歸(SVR)中的三個參數(gamma,cost和epsilon)。以下是我所做的。如何使用遺傳算法對參數進行優化

library(e1071) 
data(Ozone, package="mlbench") 
a<-na.omit(Ozone) 
index<-sample(1:nrow(a), trunc(nrow(a)/3)) 
trainset<-a[index,] 
testset<-a[-index,] 
model<-svm(V4 ~ .,data=trainset, cost=0.1, gamma=0.1, epsilon=0.1, type="eps-regression", kernel="radial") 
error<-model$residuals 
rmse <- function(error) #root mean sqaured error 
{ 
    sqrt(mean(error^2)) 
} 
rmse(error) 

在這裏,我設置成本,伽馬和epsilon分別爲0.1,但我不認爲他們是最好的價值。所以,我想使用遺傳算法來優化這些參數。

GA <- ga(type = "real-valued", fitness = rmse, 
     min = c(0.1,3), max = c(0.1,3), 
     popSize = 50, maxiter = 100) 

在這裏,我用RMSE作爲適應度函數。但我認爲健身功能必須包含要優化的參數。但是,在SVR中,目標函數太複雜,無法用R代碼寫出來,我試圖找到一個很長的時間,但無濟於事。有人同時知道SVR和GA,有人使用GA優化SVR參數的經驗,請幫助我。請。

回答

12

在這樣的應用中,一個經過其值進行優化(在你的情況下,costgammaepsilon)的參數作爲適應度函數,然後運行模型擬合+評估函數,並且使用測量的參數模型表現作爲衡量健身的指標。因此,目標函數的顯式形式並不直接相關。

在下面的實現中,我使用5倍交叉驗證來估計給定參數集的RMSE。特別是,由於包GA使適應度函數最大化,因此我已經將參數的給定值的適應值寫爲減去交叉驗證數據集上的平均rmse。因此,可達到的最大適應度爲零。

這:

library(e1071) 
library(GA) 

data(Ozone, package="mlbench") 
Data <- na.omit(Ozone) 

# Setup the data for cross-validation 
K = 5 # 5-fold cross-validation 
fold_inds <- sample(1:K, nrow(Data), replace = TRUE) 
lst_CV_data <- lapply(1:K, function(i) list(
    train_data = Data[fold_inds != i, , drop = FALSE], 
    test_data = Data[fold_inds == i, , drop = FALSE])) 

# Given the values of parameters 'cost', 'gamma' and 'epsilon', return the rmse of the model over the test data 
evalParams <- function(train_data, test_data, cost, gamma, epsilon) { 
    # Train 
    model <- svm(V4 ~ ., data = train_data, cost = cost, gamma = gamma, epsilon = epsilon, type = "eps-regression", kernel = "radial") 
    # Test 
    rmse <- mean((predict(model, newdata = test_data) - test_data$V4)^2) 
    return (rmse) 
} 

# Fitness function (to be maximized) 
# Parameter vector x is: (cost, gamma, epsilon) 
fitnessFunc <- function(x, Lst_CV_Data) { 
    # Retrieve the SVM parameters 
    cost_val <- x[1] 
    gamma_val <- x[2] 
    epsilon_val <- x[3] 

    # Use cross-validation to estimate the RMSE for each split of the dataset 
    rmse_vals <- sapply(Lst_CV_Data, function(in_data) with(in_data, 
     evalParams(train_data, test_data, cost_val, gamma_val, epsilon_val))) 

    # As fitness measure, return minus the average rmse (over the cross-validation folds), 
    # so that by maximizing fitness we are minimizing the rmse 
    return (-mean(rmse_vals)) 
} 

# Range of the parameter values to be tested 
# Parameters are: (cost, gamma, epsilon) 
theta_min <- c(cost = 1e-4, gamma = 1e-3, epsilon = 1e-2) 
theta_max <- c(cost = 10, gamma = 2, epsilon = 2) 

# Run the genetic algorithm 
results <- ga(type = "real-valued", fitness = fitnessFunc, lst_CV_data, 
    names = names(theta_min), 
    min = theta_min, max = theta_max, 
    popSize = 50, maxiter = 10) 

summary(results) 

產生的結果(爲我指定的參數值的範圍,其可以基於所述數據需要微調):

GA results: 
Iterations    = 100 
Fitness function value = -14.66315 
Solution    = 
     cost  gamma epsilon 
[1,] 2.643109 0.07910103 0.09864132 
+0

我可以告訴你它對我意味着多少......非常感謝你〜! – jihoon

+0

非常感謝!該代碼正在爲臭氧數據工作。但是,如果我從臭氧數據中刪除了一些行,或者如果我更改了特定列中的數字,則它不起作用,並且它會給出「Forecast.svm中的錯誤(ret,xhold,decision.values = TRUE): Model is empty !」錯誤。我該如何解決這個問題? –