This question is totally re-written after I confirmed my results (the Python Notebook can be found here) with a piece of code written by someone else (can be found here). Here is that code instrumented by me to work with my data and to count epochs till convergence:
import numpy as np
from matplotlib import pyplot as plt
class Perceptron(object):
"""Implements a perceptron network"""
def __init__(self, input_size, lr=0.1, epochs=1000000):
self.W = np.zeros(input_size+1)
#self.W = np.random.randn(input_size+1)
# add one for bias
self.epochs = epochs
self.lr = lr
def predict(self, x):
z = self.W.T.dot(x)
return [1 if self.W.T.dot(x) >=0 else 0]
def fit(self, X, d):
errors = []
for epoch in range(self.epochs):
if (epoch + 1) % 10000 == 0: print('Epoch',epoch + 1)
total_error = 0
for i in range(d.shape[0]):
x = np.insert(X[i], 0, 1)
y = self.predict(x)
e = d[i] - y
total_error += np.abs(e)
self.W = self.W + self.lr * e * x
#print('W: ', self.W)
errors += [total_error]
if (total_error == 0):
print('Done after', epoch, 'epochs')
nPlot = 100
plt.plot(list(range(len(errors)-nPlot, len(errors))), errors[-nPlot:])
plt.show()
break
if __name__ == '__main__':
trainingSet = np.array([[279.25746446, 162.44072328, 1. ],
[306.23240054, 128.3794866 , 1. ],
[216.67811217, 148.58167262, 1. ],
[223.64431813, 197.75745016, 1. ],
[486.68209275, 96.09115377, 1. ],
[400.71323154, 125.18183395, 1. ],
[288.87299305, 204.52217766, 1. ],
[245.1492875 , 55.75847006, -1. ],
[ 14.95991122, 185.92681911, 1. ],
[393.92908798, 193.40527965, 1. ],
[494.15988362, 179.23456285, 1. ],
[235.59039363, 175.50868526, 1. ],
[423.72071607, 9.50166894, -1. ],
[ 76.52735621, 208.33663341, 1. ],
[495.1492875 , -7.73818431, -1. ]])
X = trainingSet[:, :2]
d = trainingSet[:, -1]
d = np.where(d == -1, 1, 0)
perceptron = Perceptron(input_size=2)
perceptron.fit(X, d)
print(perceptron.W)
The training set consists of 15 points, with a large separation margin. The Perceptron algorithm finds a separator as shown below, but after as many as 122,346 epochs:
As the Wikipedia article explains, the number of epochs needed by the Perceptron to converge is proportional to the square of the size of the vectors and inverse-proportional to the square of the margin. In my data, the size of the vectors is large, but the margin is large as well.
I seek to understand why so many epochs are required.
Update: As per the request in the comments, I updated the code to plot the total errors of the last 100 epochs. Here is the plot:
P.S.:After scaling the features to be distributed as N(0,1), the algorithm converges after two epochs. However, I do not grasp why the algorithm would not converge in a reasonable amount of time even without such scaling.
The problem you're facing could be summarized in a simple statement: the numbers of your example do not favor convergence or your perceptron.
Honestly I'm not sure what exactly can be learned from your synthetic example; anyway, please don't take me wrong, it is always so good to play around in the laboratory and learn from it. There's a number of recommendations that are generic when fitting neural nets, and some of them are reflected in comments to your question. This paper is old but good and you'll see it referenced around.
About your problem in particular: it is not really a matter of standarizing but centering. The problem is that when you re-evaluate your weights
self.W = self.W + self.lr * e * x
your error term e
will be either +1 or -1 depending on the example that you mis-classify (e.g. +1 if the example target is 1 and it is classified as 0), but mostly +1 since there are more positive classes, and your coordinates in x
and mostly positive values. So, most of the times, you will be adding up to your weights, not subtracting, and this way it is obviously quite slow for the perceptron to find a solution.
If you just scale your X
X = scale(X, with_mean=True, with_std=False)
convergence takes 1461 epochs only.
The classifier looks like this
and it makes sense that the boundary is very closed to the positive classes, since there are many of them; as soon as the perceptron gets all the positive classes right, the job is nearly done.
Additionally, if you rebalance your data -I've done it in this lazy way as a test
trainingSet = np.array([[279.25746446, 162.44072328, 1. ],
[306.23240054, 128.3794866 , 1. ],
[216.67811217, 148.58167262, 1. ],
[223.64431813, 197.75745016, 1. ],
[486.68209275, 96.09115377, 1. ],
[400.71323154, 125.18183395, 1. ],
[288.87299305, 204.52217766, 1. ],
[245.1492875 , 55.75847006, -1. ],
[245.1492875 , 55.75847006, -1. ],
[245.1492875 , 55.75847006, -1. ],
[245.1492875 , 55.75847006, -1. ],
[ 14.95991122, 185.92681911, 1. ],
[393.92908798, 193.40527965, 1. ],
[494.15988362, 179.23456285, 1. ],
[235.59039363, 175.50868526, 1. ],
[423.72071607, 9.50166894, -1. ],
[423.72071607, 9.50166894, -1. ],
[423.72071607, 9.50166894, -1. ],
[423.72071607, 9.50166894, -1. ],
[423.72071607, 9.50166894, -1. ],
[ 76.52735621, 208.33663341, 1. ],
[495.1492875 , -7.73818431, -1. ],
[495.1492875 , -7.73818431, -1. ],
[495.1492875 , -7.73818431, -1. ],
[495.1492875 , -7.73818431, -1. ]])
it takes 2 epochs (surprisingly) to get this classifier
Hope it helps.
EDIT after comments
(1) About errors that are adding up or subtracting only
Let's take an example of the positive class
[279.25746446, 162.44072328, 1. ]
For these, since d
is equal to 0, e
can only be 0 if the classifier gets it right and -1 if it gets it wrong.
e = d[i] - self.predict(x)
(predict
returns either 0 or 1)
When adding up to the weight, it adds nothing if the classifier gets it right, and -1 * x * learning rate if wrong. For this example, assuming lr == 1
, it will subtract exactly (1, 279.25746446, 162.44072328)
if there is an error in this positive example.
Now, take a look to all the positive examples. If you don't transform the X, all coordinates have positive values, thus all the classification errors will subtract to the weights.
Now let's take a negative example:
[245.1492875 , 55.75847006, -1. ]
For these, since d
is equal to 1, e
can only be 0 if the classifier gets it right and +1 if it gets it wrong. Again, all coordinates are positive, except for one coordinate in the 3rd negative example. Thus nearly all mistake for the negative class will be adding.
But there are only 3 examples of the negative class, and 12 of the positive class. Thus the errors will be mostly subtracting and not adding to the weights. (Sorry I've put it the other way around in my text before the edit). It's reasonable then to think that convergence will be slow if you do nothing, faster if you center the data. (One could even wonder how it converges.)
(2) About resampling
I meant to say that convergence with resampling (and centering) is surprisingly fast, 2 epochs. However it is reasonable that resampling makes convergence faster, since there is more balance between errors pulling the output to one direction or to the other.
Hope it is more clear now.
EDIT after more comments
I understand that maybe the importance of balance between samples and how they are pulling the solution is not really intuitive. Actually, the way I faced your question was probably the opposite: by looking at your loss function, and thinking about what the problem could be, and similar problems I faced in the past and intuitions I had, I thought about rebanlancing - then tried to relabalance and after to center the data and confirmed my intuitions about your loss function. Only afterwards I tried to build an explanation for you.
Of course, it is not that I process the loss function in my mind and known what it is doing. Anyway I would suggest that you build your own intuitions, since your target is learning, and you could do it this way: plot how the separation line moves epoch after epoch.
From your code:
labels = [1, 0]
labelColors = ['blue', 'green']
def showData(X, y, plt = plt):
colors = [(labelColors[0] if el == labels[0] else labelColors[1]) for el in y]
plt.scatter(X[:,0],X[:,1],c=colors)
def plotW(xs, w):
plt.plot(xs, (w[0] + w[1] * xs)/-w[2], color = 'red', linewidth=4)
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import scale
class Perceptron(object):
"""Implements a perceptron network"""
def __init__(self, input_size, lr=0.1, epochs=1000000):
self.W = np.zeros(input_size+1)
#self.W = np.random.randn(input_size+1)
# add one for bias
self.epochs = epochs
self.lr = lr
def predict(self, x):
z = self.W.T.dot(x)
return [1 if self.W.T.dot(x) >=0 else 0]
def fit(self, X, d):
errors = []
for epoch in range(self.epochs):
if (epoch + 1) % 10000 == 0: print('Epoch',epoch + 1)
total_error = 0
for i in range(d.shape[0]):
x = np.insert(X[i], 0, 1)
y = self.predict(x)
e = d[i] - y
total_error += np.abs(e)
self.W = self.W + self.lr * e * x
#print('W: ', self.W)
errors += [total_error]
showData(X, d)
plotW(X[:,0], self.W)
plt.show()
if epoch == 100:
break
if (total_error == 0):
print('Done after', epoch, 'epochs')
nPlot = 100
plt.plot(list(range(len(errors)-nPlot, len(errors))), errors[-nPlot:])
plt.show()
break
if __name__ == '__main__':
trainingSet = np.array([[279.25746446, 162.44072328, 1. ],
[306.23240054, 128.3794866 , 1. ],
[216.67811217, 148.58167262, 1. ],
[223.64431813, 197.75745016, 1. ],
[486.68209275, 96.09115377, 1. ],
[400.71323154, 125.18183395, 1. ],
[288.87299305, 204.52217766, 1. ],
[245.1492875 , 55.75847006, -1. ],
[ 14.95991122, 185.92681911, 1. ],
[393.92908798, 193.40527965, 1. ],
[494.15988362, 179.23456285, 1. ],
[235.59039363, 175.50868526, 1. ],
[423.72071607, 9.50166894, -1. ],
[ 76.52735621, 208.33663341, 1. ],
[495.1492875 , -7.73818431, -1. ]])
X = trainingSet[:, :2]
X = scale(X, with_mean=True, with_std=False)
d = trainingSet[:, -1]
d = np.where(d == -1, 1, 0)
perceptron = Perceptron(input_size=2)
perceptron.fit(X, d)
print(perceptron.W)
And compare the evolution of the line in the different setups. If you compare the first 100 epochs when centering versus not centering, you will see that when you do not center the data, the line tends to bump in a sort of a loop, while when centering the line moves more smoothly. (That's actually the same kind of effect you usually get when slowing down the learning rate, as some people suggested in comments.)
I don't mean to say that looking at those plots is analytical evidence for the behavior of your loss function. I don't even pretend that this a real answer to your question. But anyway, if it helps you build an intuition, then it will be worth it.
There's loads of work about convergence, which has been applied extensively in Deep Learning since it is a key issue, as you probably know. Sure you've heard about the different optimizers and how they affect convergence of a loss function that, in Deep Learning or in complex neural nets in general, is certainly difficult to understand and impossible to tackle analytically.
When I could not answer properly your question one month ago I kind of regreted it; now I give it another try. I leave the older answer for the record.
I think the problem relates to convexity and local minima of the loss function, which makes it difficult to converge. However, with your problem as you did set it up, I'm not really sure about the derivative of your loss function, so I've modified your activation function to a sigmoid, so I can apply the log
loss easily.
This is the new predict
,
def predict(self, x):
z = self.W.T.dot(x)
return 1/(1+np.exp(-z))
And this is the loop for the training data, also calculating the loss.
loss = 0
dw = 0
for i in range(d.shape[0]):
x = np.insert(X[i], 0, 1)
y = self.predict(x)
e = d[i] - (1 if y > 0.5 else 0)
total_error += np.abs(e)
dw += self.lr * e * x
loss2add = (-1) * (np.log(y) if d[i] else np.log(1-y))
if np.isinf(loss2add) or np.isnan(loss2add):
loss += 500
else:
loss += loss2add
self.W = self.W + dw
errors += [total_error]
losses += [loss/d.shape[0]]
It converges in 103K epochs, so I hope you believe this behaves similarly to your initial setup.
Then I plot the cost function related to W
. To make it simple, I take 2 values of a known solution, and only change the remaining 1 value. This is the code (could be cleaner I know):
def predict(W, x):
z = W.dot(x)
return 1/(1+np.exp(-z))
trainingSet = np.array([[279.25746446, 162.44072328, 1. ],
[306.23240054, 128.3794866 , 1. ],
[216.67811217, 148.58167262, 1. ],
[223.64431813, 197.75745016, 1. ],
[486.68209275, 96.09115377, 1. ],
[400.71323154, 125.18183395, 1. ],
[288.87299305, 204.52217766, 1. ],
[245.1492875 , 55.75847006, -1. ],
[ 14.95991122, 185.92681911, 1. ],
[393.92908798, 193.40527965, 1. ],
[494.15988362, 179.23456285, 1. ],
[235.59039363, 175.50868526, 1. ],
[423.72071607, 9.50166894, -1. ],
[ 76.52735621, 208.33663341, 1. ],
[495.1492875 , -7.73818431, -1. ]])
X = trainingSet[:, :2]
d = trainingSet[:, -1]
d = np.where(d == -1, 1, 0)
losses = []
ws = []
n_points = 10001
for w1 in np.linspace(-40, 40, n_points):
ws += [w1]
W = np.array([3629., w1, -238.21109877])
loss = 0
for i in range(d.shape[0]):
x = np.insert(X[i], 0, 1)
y = predict(W, x)
loss2add = (-1) * (np.log(y) if d[i] else np.log(1-y))
if np.isinf(loss2add) or np.isnan(loss2add):
loss += 500
else:
loss += loss2add
losses += [loss]
plt.plot(ws, losses)
plt.show()
The solution for w1 is 39.48202635
. Take a look at the loss:
which has some peaks and thus some local minima in which it can get easily stuck.
However if you center the data with
X = scale(X, with_mean=True, with_std=False)
and set the w's to
W = np.array([-550.3, w1, -59.65467824])
you get the following loss function
which has the minimum at the expected area (the solution for w1 is -11.00208344
).
I'd expect a smoother function for the balanced dataset.
Hope it is clearer now!
EDIT after comments
This is the loss function when standarizing -converges in 26 epochs.
(Not centering in this case!)
Solution about 0.7, and the loss is even smoother. It makes sense that standarizing works so well with logistic regression, since it does not saturate the output of the activation function.
For the rest, I don't have anything to add on how to fit these with the theory you mention. I guess the theorem fixes an upper bound, but anyhow no idea. Cheers.