I recently completed exercise 3 of Andrew Ng's Machine Learning on Coursera using Python.
When initially completing parts 1.4 to 1.4.1 of the exercise, I ran into difficulties ensuring that my trained model has the accuracy that matches the expected 94.9%. Even after debugging and ensuring that my cost and gradient functions were bug free, and that my predictor code was working correctly, I was still getting only 90.3% accuracy. I was using the conjugate gradient (CG) algorithm in scipy.optimize.minimize.
Out of curiosity, I decided to try another algorithm, and used Broyden–Fletcher–Goldfarb–Shannon (BFGS). To my surprise, the accuracy improved drastically to 96.5% and thus exceeded the expectation. The comparison of these two different results between CG and BFGS can be viewed in my notebook under the header Difference in accuracy due to different optimization algorithms.
Is the reason for this difference in accuracy due to the different choice of optimization algorithm? If yes, then could someone explain why?
Also, I would greatly appreciate any review of my code just to make sure that there isn't a bug in any of my functions that is causing this. I suspect also that there could be bug which is responsible for this.
Thank you.
EDIT: Here below I added the code involved in the question, to make it easier for anyone to help me without referring to my Jupyter notebook.
Model cost functions:
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def compute_cost_regularized(theta, X, y, lda):
reg =lda/(2*len(y)) * np.sum(theta[1:]**2)
return 1/len(y) * np.sum(-y @ np.log(sigmoid(X@theta))
- (1-y) @ np.log(1-sigmoid(X@theta))) + reg
def compute_gradient_regularized(theta, X, y, lda):
gradient = np.zeros(len(theta))
XT = X.T
beta = sigmoid(X@theta) - y
regterm = lda/len(y) * theta
# theta_0 does not get regularized, so a 0 is substituted in its place
regterm[0] = 0
gradient = (1/len(y) * XT@beta).T + regterm
return gradient
Function that implements one-vs-all classification training:
from scipy.optimize import minimize
def train_one_vs_all(X, y, opt_method):
theta_all = np.zeros((y.max()-y.min()+1, X.shape[1]))
for k in range(y.min(),y.max()+1):
grdtruth = np.where(y==k, 1,0)
results = minimize(compute_cost_regularized, theta_all[k-1,:],
args = (X,grdtruth,0.1),
method = opt_method,
jac = compute_gradient_regularized)
# optimized parameters are accessible through the x attribute
theta_optimized = results.x
# Assign thetheta_optimized vector to the appropriate row in the
# theta_all matrix
theta_all[k-1,:] = theta_optimized
return theta_all
Called the function to train the model with different optimization methods:
theta_all_optimized_cg = train_one_vs_all(X_bias, y, 'CG') # Optimization performed using Conjugate Gradient
theta_all_optimized_bfgs = train_one_vs_all(X_bias, y, 'BFGS') # optimization performed using Broyden–Fletcher–Goldfarb–Shanno
We see that prediction results differ based on the algorithm used:
def predict_one_vs_all(X, theta):
return np.mean(np.argmax(sigmoid(X@theta.T), axis=1)+1 == y)*100
In[16]: predict_one_vs_all(X_bias, theta_all_optimized_cg)
Out[16]: 90.319999999999993
In[17]: predict_one_vs_all(X_bias, theta_all_optimized_bfgs)
Out[17]: 96.480000000000004
For anyone wanting to get any data to try the code, they can find it in my Github as linked in this post.