Using the notations from Backpropagation calculus | Deep learning, chapter 4, I have this back-propagation code for a 4-layer (i.e. 2 hidden layers) neural network:
def sigmoid_prime(z):
return z * (1-z) # because σ'(x) = σ(x) (1 - σ(x))
def train(self, input_vector, target_vector):
a = np.array(input_vector, ndmin=2).T
y = np.array(target_vector, ndmin=2).T
# forward
A = [a]
for k in range(3):
a = sigmoid(np.dot(self.weights[k], a)) # zero bias here just for simplicity
A.append(a)
# Now A has 4 elements: the input vector + the 3 outputs vectors
# back-propagation
delta = a - y
for k in [2, 1, 0]:
tmp = delta * sigmoid_prime(A[k+1])
delta = np.dot(self.weights[k].T, tmp) # (1) <---- HERE
self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)
It works, but:
the accuracy at the end (for my use case: MNIST digit recognition) is just ok, but not very good. It is much better (i.e. the convergence is much better) when the line (1) is replaced by:
delta = np.dot(self.weights[k].T, delta) # (2)
the code from Machine Learning with Python: Training and Testing the Neural Network with MNIST data set also suggests:
delta = np.dot(self.weights[k].T, delta)
instead of:
delta = np.dot(self.weights[k].T, tmp)
(With the notations of this article, it is:
output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)
)
These 2 arguments seem to be concordant: code (2) is better than code (1).
However, the math seem to show the contrary (see video here; another detail: note that my loss function is multiplied by 1/2 whereas it's not on the video):
Question: which one is correct: the implementation (1) or (2)?
In LaTeX:
$$\frac{\partial{C}}{\partial{w^{L-1}}} = \frac{\partial{z^{L-1}}}{\partial{w^{L-1}}} \frac{\partial{a^{L-1}}}{\partial{z^{L-1}}} \frac{\partial{C}}{\partial{a^{L-1}}}=a^{L-2} \sigma'(z^{L-1}) \times w^L \sigma'(z^L)(a^L-y) $$
$$\frac{\partial{C}}{\partial{w^L}} = \frac{\partial{z^L}}{\partial{w^L}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=a^{L-1} \sigma'(z^L)(a^L-y)$$
$$\frac{\partial{C}}{\partial{a^{L-1}}} = \frac{\partial{z^L}}{\partial{a^{L-1}}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=w^L \sigma'(z^L)(a^L-y)$$
I spent two days to analyze this problem, I filled a few pages of notebook with partial derivative computations... and I can confirm:
the code (1) is the correct one, and it agrees with the math computations:
code (2) is wrong:
and there a slight mistake in Machine Learning with Python: Training and Testing the Neural Network with MNIST data set:
should be
Now the difficult part that took me days to realize:
Apparently the code (2) has a far better convergence than code (1), that's why I mislead into thinking code (2) was correct and code (1) was wrong
... But in fact that's just a coincidence because the
learning_rate
was set too low. Here is the reason: when using code (2), the parameterdelta
is growing much faster (print np.linalg.norm(delta)
helps to see this) than with the code (1).Thus "incorrect code (2)" just compensated the "too slow learning rate" by having a bigger
delta
parameter, and it lead, in some cases, to an apparently faster convergence.Now solved!