不同的Python最小化函數會給出不同的值，爲什麼？

我試圖通過重寫Andrew Ng的機器學習課程作業從八度（我拿了分類並拿到證書）來學習python。我遇到了優化功能方面的問題。在這個過程中，他們使用fmincg這是Octave中使用的函數來最小化提供其導數的線性迴歸的成本函數（凸函數）。他們還教你如何使用梯度下降和正常方程，如果它們使用得當，它們在理論上都會給你相同的結果（在小數點後面）。它們都適用於線性迴歸，並且我在python中得到了相同的結果。爲了清楚，我試圖最小化成本函數來找到數據集的最佳擬合參數（theta）。到目前爲止，我已經使用了不需要衍生物的'nelder-mead'，它給了我最接近的解決方案。我也嘗試過「TNC」，「CG」和「BFGS」，它們都需要一個派生函數來最小化函數。當我有一階多項式（線性）時它們都工作得很好，但是當我將多項式的階數增加到非線性時，在我的情況下，我有x^1到x^8，那麼我不能得到我的函數以適應數據集。我正在做的練習非常簡單，我有12個數據點，所以把8階多項式應該捕捉每一個點（如果你好奇它是一個高方差的例子，即過度擬合數據）。他們展示的解決方案是按預期貫穿所有數據點並捕獲所有內容的一條線。我得到的最好的結果是當我使用'nelder-mead'方法時，它只捕獲了數據集中的兩點，而其餘的最小化函數甚至沒有給我任何接近我想要的東西。我不知道有什麼問題，因爲我的成本函數和梯度爲線性情況提供了正確的值，所以我假設它們工作正常（Octave的確切答案）。不同的Python最小化函數會給出不同的值，爲什麼？

我將列出Octave和Python中的函數，希望有人能向我解釋爲什麼我會得到不同的答案。或者指出我沒有看到的明顯錯誤。

function [J, grad] = linearRegCostFunction(X, y, theta, lambda) 
%LINEARREGCOSTFUNCTION Compute cost and gradient for regularized linear 
%regression with multiple variables 
% [J, grad] = LINEARREGCOSTFUNCTION(X, y, theta, lambda) computes the 
% cost of using theta as the parameter for linear regression to fit the 
% data points in X and y. Returns the cost in J and the gradient in grad 


m = length(y); % number of training examples 
J = 0; 
grad = zeros(size(theta)); 

htheta = X * theta; 
n = size(theta); 
J = 1/(2 * m) * sum((htheta - y) .^ 2) + lambda/(2 * m) * sum(theta(2:n) .^ 2); 

grad = 1/m * X' * (htheta - y); 
grad(2:n) = grad(2:n) + lambda/m * theta(2:n); # we leave the bias nice 
grad = grad(:); 

end

這裏是我的代碼片段，如果有人喜歡完整的代碼，我可以提供，以及：

def costFunction(theta, Xcost, y, lmda): 
    m = len(y) 
    theta = theta.reshape((len(theta),1)) 
    htheta = np.dot(Xcost,theta) - y 
    J = 1/(2 * m) * np.dot(htheta.T,htheta) + lmda/(2 * m) * np.sum(theta[1:,:]**2) 
    return J 

def gradCostFunc(gradtheta, X, y, lmda): 
    m = len(y) 
    gradtheta = gradtheta.reshape((len(gradtheta),1)) 
    hgradtheta = np.dot(X,gradtheta) - y 
    #gradtheta[0,0] = 0. 

    grad = (1/m) * np.dot(X.T, hgradtheta) 

    #for i in range(1,len(grad)): 
    grad[1:,0] = grad[1:,0] + (lmda/m) * gradtheta[1:,0] 
    return grad.reshape((len(grad))) 

def normalEqn(X, y, lmda): 
    e = np.eye(X.shape[1]) 
    e[0,0] = 0 
    theta = np.dot(np.linalg.pinv(np.dot(X.T,X) + lmda * e),np.dot(X.T,y)) 
    return theta 

def gradientDescent(X, y, theta, alpha, lmda, num_iters): 
    # calculate gradient descent in an iterative manner 
    m = len(y) 
    # J_history tracks the evolution of the cost function 
    J_history = np.zeros((num_iters,1)) 

    # Calculating the gradients 
    for i in range(0, num_iters): 
     grad = np.zeros((len(theta),1)) 
     grad = gradCostFunc(theta, X, y, lmda) 
     #updating the thetas 
     theta = theta - alpha * grad 
     J_history[i] = costFunction(theta, X, y, lmda) 

    plt.plot(J_history) 
    plt.show() 

    return theta 

def trainLR(initheta, X, y, lmda): 
    #print theta.shape, X.shape, y.shape, gradtest.shape gradCostFunc 
    options = {'maxiter': 1000} 
    res = optimize.minimize(costFunction, initheta, jac=gradCostFunc, method='CG',       args=(X, y, lmda), options = options) 
    #res = optimize.minimize(costFunction, theta, method='nelder-mead',        args=(X,y,lmda), options={'disp': False}) 
    #res = optimize.fmin_bfgs(costFunction, theta, fprime=gradCostFunc, args=(X, y, lmda)) 
    return res.x 

def polyFeatures(X, degree): 
    # map the higher polynomials 
    out = X 
    if degree >= 2: 
     for i in range(2,degree+1): 
      out = np.column_stack((out,X**i)) 
     return out 
    else: 
     return out 

def featureNormalize(X): 
    # Since the values will vary by orders of magnitudes 
    # It’s important to normalize the various features 
    mu = np.mean(X, axis=0) 
    S1 = np.std(X, axis=0) 
    return mu, S1, (X - mu)/S1

這裏是這些功能的主要調用：

X, y, Xval, yval, Xtest, ytest = loadData('ex5data1.mat') 
X_poly = X # to be used in the later on in the program 
p = 8 
X_poly = polyFeatures(X_poly, p) 
mu, sigma, X_poly = featureNormalize(X_poly) 
X_poly = padding(X_poly) 
theta = np.zeros((X_poly.shape[1],1)) 
theta = trainLR(theta, X_poly, y, 0.) 
#theta = normalEqn(X_poly, y, 0.) 
#theta = gradientDescent(X_poly, y, theta, 0.1, 0, 1500)

來源

2013-12-20 Henry80s

爲什麼不比較你的結果在每一步到你的八度的正確結果？您可以打印您的成本函數和研究生成本函數的中間結果。 – lennon310

我的答案可能是關鍵，因爲你的問題是幫助調試你當前的實現。

這就是說，如果你有興趣在Python中使用現成的優化器，那麼看看OpenOpt。該庫包含針對各種優化問題的優化器的相當高性能的實現。

我還應該提到scikit-learn庫爲Python提供了一個很好的機器學習工具集。

來源

2013-12-20 21:25:43 Rob

如果你可以幫助找到這個很好的bug，但是我在這裏問的是爲什麼我會爲不同的功能得到不同的答案？ – Henry80s

@ Henry80s：我想我誤解了你原來的問題。您是否（a）在使用Octave或Python時獲得不同的答案，或者（b）獲得更高階多項式擬合的意外結果？ – Rob

我在這裏問的所有是爲什麼我得到不同的功能不同的答案？另外，我的costFunction和gradCostFunction看起來是否正確？ – Henry80s

不同的Python最小化函數會給出不同的值，爲什麼？

回答

相關問題