In the output layer of a neural network, it is typical to use the softmax function to approximate a probability distribution:
This is expensive to compute because of the exponents. Why not simply perform a Z transform so that all outputs are positive, and then normalise just by dividing all outputs by the sum of all outputs?
Adding to Piotr Czapla answer, the greater the input values, the greater the probability for the maximum input, for same proportion and compared to the other inputs:
I have found the explanation here to be very good: CS231n: Convolutional Neural Networks for Visual Recognition.
On the surface the softmax algorithm seems to be a simple non-linear (we are spreading the data with exponential) normalization. However, there is more than that.
Specifically there are a couple different views (same link as above):
Information Theory - from the perspective of information theory the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth.
Probabilistic View - from this perspective we are in fact looking at the log-probabilities, thus when we perform exponentiation we end up with the raw probabilities. In this case the softmax equation find the MLE (Maximum Likelihood Estimate)
In summary, even though the softmax equation seems like it could be arbitrary it is NOT. It is actually a rather principled way of normalizing the classifications to minimize cross-entropy/negative likelihood between predictions and the truth.
We are looking at a multi-classification problem. The predicted variable
y
can take one ofk
values, wherek > 2
. In probability, this satisfies a multinomial distribution, and multinomial distribution belongs to a big family called exponential family. According to the property of exponential family distributions, we can reconstruct the probability ofP(k=?|x)
, it coincides with the softmax formula.For further information and a formal proof reference CS229 lecture notes (Softmax Regression).
A useful trick usually perform to softmax: softmax(x) = softmax(x+c), that is, softmax is invariant to constant offsets in the input.