Why use softmax as opposed to standard normalizati

2019-01-16 00:12发布

In the output layer of a neural network, it is typical to use the softmax function to approximate a probability distribution:

enter image description here

This is expensive to compute because of the exponents. Why not simply perform a Z transform so that all outputs are positive, and then normalise just by dividing all outputs by the sum of all outputs?

9条回答
时光不老,我们不散
2楼-- · 2019-01-16 01:03

Adding to Piotr Czapla answer, the greater the input values, the greater the probability for the maximum input, for same proportion and compared to the other inputs:

enter image description here

查看更多
神经病院院长
3楼-- · 2019-01-16 01:09

I have found the explanation here to be very good: CS231n: Convolutional Neural Networks for Visual Recognition.

On the surface the softmax algorithm seems to be a simple non-linear (we are spreading the data with exponential) normalization. However, there is more than that.

Specifically there are a couple different views (same link as above):

  1. Information Theory - from the perspective of information theory the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth.

  2. Probabilistic View - from this perspective we are in fact looking at the log-probabilities, thus when we perform exponentiation we end up with the raw probabilities. In this case the softmax equation find the MLE (Maximum Likelihood Estimate)

In summary, even though the softmax equation seems like it could be arbitrary it is NOT. It is actually a rather principled way of normalizing the classifications to minimize cross-entropy/negative likelihood between predictions and the truth.

查看更多
甜甜的少女心
4楼-- · 2019-01-16 01:09

We are looking at a multi-classification problem. The predicted variable y can take one of k values, where k > 2. In probability, this satisfies a multinomial distribution, and multinomial distribution belongs to a big family called exponential family. According to the property of exponential family distributions, we can reconstruct the probability of P(k=?|x), it coincides with the softmax formula.

For further information and a formal proof reference CS229 lecture notes (Softmax Regression).

A useful trick usually perform to softmax: softmax(x) = softmax(x+c), that is, softmax is invariant to constant offsets in the input.

enter image description herse

查看更多
登录 后发表回答