For a neural networks library I implemented some activation functions and loss functions and their derivatives. They can be combined arbitrarily and the derivative at the output layers just becomes the product of the loss derivative and the activation derivative.
However, I failed to implement the derivative of the Softmax activation function independently from any loss function. Due to the normalization i.e. the denominator in the equation, changing a single input activation changes all output activations and not just one.
Here is my Softmax implementation where the derivative fails the gradient checking by about 1%. How can I implement the Softmax derivative so that it can be combined with any loss function?
import numpy as np
class Softmax:
def compute(self, incoming):
exps = np.exp(incoming)
return exps / exps.sum()
def delta(self, incoming, outgoing):
exps = np.exp(incoming)
others = exps.sum() - exps
return 1 / (2 + exps / others + others / exps)
activation = Softmax()
cost = SquaredError()
outgoing = activation.compute(incoming)
delta_output_layer = activation.delta(incoming) * cost.delta(outgoing)
It should be like this: (x is the input to the softmax layer and dy is the delta coming from the loss above it)
But the way you compute the error should be:
Explanation: Because the
delta
function is a part of the backpropagation algorithm, its responsibility is to multiply the vectordy
(in my code,outgoing
in your case) by the Jacobian of thecompute(x)
function evaluated atx
. If you work out what does this Jacobian look like for softmax [1], and then multiply it from the left by a vectordy
, after a bit of algebra you'll find out that you get something that corresponds to my Python code.[1] https://stats.stackexchange.com/questions/79454/softmax-layer-in-a-neural-network
Mathematically, the derivative of Softmax σ(j) with respect to the logit Zi (for example, Wi*X) is
where the red delta is a Kronecker delta.
If you implement iteratively:
Test:
If you implement in a vectorized version:
Here is a c++ vectorized version, using intrinsics ( 22 times (!) faster than the non-SSE version):
If for some reason somebody wants a simple (non-SSE) version, here it is: