I'm trying to write some very simple LMS batch gradient descent but I believe I'm doing something wrong with the gradient. The ratio between the order of magnitude and the initial values for theta
is very different for the elements of theta
so either theta[2]
doesn't move (e.g. if alpha = 1e-8
) or theta[1]
shoots off (e.g. if alpha = .01
).
import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x = np.concatenate((np.ones((5,1), dtype=np.int), x), axis=1)
theta = np.array([[0.], [.1], [50.]])
alpha = .01
for i in range(1,1000):
h = np.dot(x, theta)
gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
theta -= alpha * gradient
print ((h - y)**2).sum(), theta.squeeze().tolist()
The algorithm as written is completely correct, but without feature scaling, convergence will be extremely slow as one feature will govern the gradient calculation.
You can perform the scaling in various ways; for now, let us just scale the features by their L^1 norms because it's simple
That is, the sum of every column in
x
is 1. If you want to retain your good guess at the correct parameters, those have to be scaled accordingly.With this, we may proceed as you did in your original post, where again you will have to play around with the learning rate until you find a sweet spot.
Let's see what we get now that we've found something that seems to converge. Again, your parameters will have to be scaled to relate to the original unscaled features.
At this point, let's cheat and check our results
A general introductory reference on feature scaling is this Stanford lecture.