Coursera ML - Does the choice of optimization algo

2019-09-15 16:58发布

问题:

I recently completed exercise 3 of Andrew Ng's Machine Learning on Coursera using Python.

When initially completing parts 1.4 to 1.4.1 of the exercise, I ran into difficulties ensuring that my trained model has the accuracy that matches the expected 94.9%. Even after debugging and ensuring that my cost and gradient functions were bug free, and that my predictor code was working correctly, I was still getting only 90.3% accuracy. I was using the conjugate gradient (CG) algorithm in scipy.optimize.minimize.

Out of curiosity, I decided to try another algorithm, and used Broyden–Fletcher–Goldfarb–Shannon (BFGS). To my surprise, the accuracy improved drastically to 96.5% and thus exceeded the expectation. The comparison of these two different results between CG and BFGS can be viewed in my notebook under the header Difference in accuracy due to different optimization algorithms.

Is the reason for this difference in accuracy due to the different choice of optimization algorithm? If yes, then could someone explain why?

Also, I would greatly appreciate any review of my code just to make sure that there isn't a bug in any of my functions that is causing this. I suspect also that there could be bug which is responsible for this.

Thank you.

EDIT: Here below I added the code involved in the question, to make it easier for anyone to help me without referring to my Jupyter notebook.

Model cost functions:

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def compute_cost_regularized(theta, X, y, lda):
    reg =lda/(2*len(y)) * np.sum(theta[1:]**2) 
    return 1/len(y) * np.sum(-y @ np.log(sigmoid(X@theta)) 
                             - (1-y) @ np.log(1-sigmoid(X@theta))) + reg

def compute_gradient_regularized(theta, X, y, lda):
    gradient = np.zeros(len(theta))
    XT = X.T
    beta = sigmoid(X@theta) - y
    regterm = lda/len(y) * theta
    # theta_0 does not get regularized, so a 0 is substituted in its place
    regterm[0] = 0 
    gradient = (1/len(y) * XT@beta).T + regterm
    return gradient

Function that implements one-vs-all classification training:

from scipy.optimize import minimize

def train_one_vs_all(X, y, opt_method):
    theta_all = np.zeros((y.max()-y.min()+1, X.shape[1]))
    for k in range(y.min(),y.max()+1):
        grdtruth = np.where(y==k, 1,0)
        results = minimize(compute_cost_regularized, theta_all[k-1,:], 
                           args = (X,grdtruth,0.1),
                           method = opt_method, 
                           jac = compute_gradient_regularized)
        # optimized parameters are accessible through the x attribute
        theta_optimized = results.x
        # Assign thetheta_optimized vector to the appropriate row in the 
        # theta_all matrix
        theta_all[k-1,:] = theta_optimized
    return theta_all

Called the function to train the model with different optimization methods:

theta_all_optimized_cg = train_one_vs_all(X_bias, y, 'CG')  # Optimization performed using Conjugate Gradient
theta_all_optimized_bfgs = train_one_vs_all(X_bias, y, 'BFGS') # optimization performed using Broyden–Fletcher–Goldfarb–Shanno

We see that prediction results differ based on the algorithm used:

def predict_one_vs_all(X, theta):
    return np.mean(np.argmax(sigmoid(X@theta.T), axis=1)+1 == y)*100

In[16]: predict_one_vs_all(X_bias, theta_all_optimized_cg)
Out[16]: 90.319999999999993

In[17]: predict_one_vs_all(X_bias, theta_all_optimized_bfgs)
Out[17]: 96.480000000000004

For anyone wanting to get any data to try the code, they can find it in my Github as linked in this post.