I am having trouble understanding the weight update rule for perceptrons:
w(t + 1) = w(t) + y(t)x(t).
Assume we have a linearly separable data set.
- w is a set of weights [w0, w1, w2, ...] where w0 is a bias.
- x is a set of input parameters [x0, x1, x2, ...] where x0 is fixed at 1 to accommodate the bias.
At iteration t, where t = 0, 1, 2, ...,
- w(t) is the set of weights at iteration t.
- x(t) is a misclassified training example.
- y(t) is the target output of x(t) (either -1 or 1).
Why does this update rule move the boundary in the right direction?
The perceptron's output is the hard limit of the dot product between the instance and the weight. Let's see how this changes after the update. Since
w(t + 1) = w(t) + y(t)x(t),
then
x(t) ⋅ w(t + 1) = x(t) ⋅ w(t) + x(t) ⋅ (y(t) x(t)) = x(t) ⋅ w(t) + y(t) [x(t) ⋅ x(t))].
Note that:
- By the algorithm's specification, the update is only applied if x(t) was misclassified.
- By the definition of the dot product, x(t) ⋅ x(t) ≥ 0.
How does this move the boundary relative to x(t)?
- If x(t) was correctly classified, then the algorithm does not apply the update rule, so nothing changes.
- If x(t) was incorrectly classified as negative, then y(t) = 1. It follows that the new dot product increased by x(t) ⋅ x(t) (which is positive). The boundary moved in the right direction as far as x(t) is concerned, therefore.
- Conversely, if x(t) was incorrectly classified as positive, then y(t) = -1. It follows that the new dot product decreased by x(t) ⋅ x(t) (which is positive). The boundary moved in the right direction as far as x(t) is concerned, therefore.
A better derivation of the perceptron update rule is documented here and here. The derivation is using gradient descent.
- Basic premise of gradient descent algorithm is find the error of classification and make your parameters so that error is minimized.
PS:
I was trying very hard to get the intuition on why would someone multiply x and y to derive the update for w. Because w is the slope for a single dimension (y = wx+c) and slope w = (y/x) and not y * x.