In the output layer of a neural network, it is typical to use the softmax function to approximate a probability distribution:

enter image description here

This is expensive to compute because of the exponents. Why not simply perform a Z transform so that all outputs are positive, and then normalise just by dividing all outputs by the sum of all outputs?

标签： math neural-network softmax

9条回答

干净又极端

2楼-- · 2019-01-16 00:51

The values of q_i represent log-likelihoods. In order to recover the probability values, you need to exponentiate them.

One reason that statistical algorithms often use log-likelihood loss functions is that they are more numerically stable: a product of probabilities may be represented be a very small floating point number. Using a log-likelihood loss function, a product of probabilities becomes a sum.

Another reason is that log-likelihoods occur naturally when deriving estimators for random variables that are assumed to be drawn from multivariate Gaussian distributions. See for example the Maximum Likelihood (ML) estimator and the way it is connected to least squares.

As a sidenote, I think that this question is more appropriate for the CS Theory or Computational Science Stack Exchanges.

0人赞添加讨论(0) 举报

仙女界的扛把子

3楼-- · 2019-01-16 00:51

I think one of the reasons can be to deal with the negative numbers and division by zero, since exp(x) will always be positive and greater than zero.

For example for a = [-2, -1, 1, 2] the sum will be 0, we can use softmax to avoid division by zero.

0人赞添加讨论(0) 举报

霸刀☆藐视天下

4楼-- · 2019-01-16 00:54

The choice of the softmax function seems somehow arbitrary as there are many other possible normalizing functions. It is thus unclear why the log-softmax loss would perform better than other loss alternatives.

From "An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family" https://arxiv.org/abs/1511.05042

The authors explored some other functions among which are Taylor expansion of exp and so called spherical softmax and found out that sometimes they might perform better than usual softmax.

0人赞添加讨论(0) 举报

Animai°情兽

5楼-- · 2019-01-16 00:56

There is one nice attribute of Softmax as compared with standard normalisation.

It react to low stimulation (think blurry image) of your neural net with rather uniform distribution and to high stimulation (ie. large numbers, think crisp image) with probabilities close to 0 and 1.

While standard normalisation does not care as long as the proportion are the same.

Have a look what happens when soft max has 10 times larger input, ie your neural net got a crisp image and a lot of neurones got activated

>>> softmax([1,2])              # blurry image of a ferret
[0.26894142,      0.73105858])  #     it is a cat perhaps !?
>>> softmax([10,20])            # crisp image of a cat
[0.0000453978687, 0.999954602]) #     it is definitely a CAT !

And then compare it with standard normalisation

>>> std_norm([1,2])                      # blurry image of a ferret
[0.3333333333333333, 0.6666666666666666] #     it is a cat perhaps !?
>>> std_norm([10,20])                    # crisp image of a cat
[0.3333333333333333, 0.6666666666666666] #     it is a cat perhaps !?

0人赞添加讨论(0) 举报

时光不老，我们不散

6楼-- · 2019-01-16 00:56

I've had this question for months. It seems like we just cleverly guessed the softmax as an output function and then interpret the input to the softmax as log-probabilities. As you said, why not simply normalize all outputs by dividing by their sum? I found the answer in the Deep Learning book by Goodfellow, Bengio and Courville (2016) in section 6.2.2.

Let's say our last hidden layer gives us z as an activation. Then the softmax is defined as

$\text{softmax}(z)_i = \frac{\exp(z_i)}{\sum_j \exp(z_j)}$

Very Short Explanation

The exp in the softmax function roughly cancels out the log in the cross-entropy loss causing the loss to be roughly linear in z_i. This leads to a roughly constant gradient, when the model is wrong, allowing it to correct itself quickly. Thus, a wrong saturated softmax does not cause a vanishing gradient.

Short Explanation

The most popular method to train a neural network is Maximum Likelihood Estimation. We estimate the parameters theta in a way that maximizes the likelihood of the training data (of size m). Because the likelihood of the whole training dataset is a product of the likelihoods of each sample, it is easier to maximize the log-likelihood of the dataset and thus the sum of the log-likelihood of each sample indexed by k:

$\underset{\theta}{\text{argmax}} \sum_{k=1}^m \log(P(y^{(k)} | x^{(k)}; \theta )))$

Now, we only focus on the softmax here with z already given, so we can replace

$P(y^{(k)} | x^{(k)}; \theta ) = P(y^{(k)} | z) = \text{softmax}(z)_i$

with i being the correct class of the kth sample. Now, we see that when we take the logarithm of the softmax, to calculate the sample's log-likelihood, we get:

$\log \text{softmax}(z)_i = z_i - \log \sum_j \exp(z_j)$

, which for large differences in z roughly approximates to

$\log \text{softmax}(z)_i = z_i - \max_j z_j$

First, we see the linear component z_i here. Secondly, we can examine the behavior of max(z) for two cases:

If the model is correct, then max(z) will be z_i. Thus, the log-likelihood asymptotes zero (i.e. a likelihood of 1) with a growing difference between z_i and the other entries in z.
If the model is incorrect, then max(z) will be some other z_j > z_i. So, the addition of z_i does not fully cancel out -z_j and the log-likelihood is roughly (z_i - z_j). This clearly tells the model what to do to increase the log-likelihood: increase z_i and decrease z_j.

We see that the overall log-likelihood will be dominated by samples, where the model is incorrect. Also, even if the model is really incorrect, which leads to a saturated softmax, the loss function does not saturate. It is approximately linear in z_j, meaning that we have a roughly constant gradient. This allows the model to correct itself quickly. Note that this is not the case for the Mean Squared Error for example.

Long Explanation

If the softmax still seems like an arbitrary choice to you, you can take a look at the justification for using the sigmoid in logistic regression:

Why sigmoid function instead of anything else?

The softmax is the generalization of the sigmoid for multi-class problems justified analogously.

0人赞添加讨论(0) 举报

成全新的幸福

7楼-- · 2019-01-16 01:00

Suppose we change the softmax function so the output activations are given by

where c is a positive constant. Note that c=1 corresponds to the standard softmax function. But if we use a different value of c we get a different function, which is nonetheless qualitatively rather similar to the softmax. In particular, show that the output activations form a probability distribution, just as for the usual softmax. Suppose we allow c to become large, i.e., c→∞. What is the limiting value for the output activations a^L_j? After solving this problem it should be clear to you why we think of the c=1 function as a "softened" version of the maximum function. This is the origin of the term "softmax". You can follow the details from this source (equation 83).

0人赞添加讨论(0) 举报

1 2 下一页

Why use softmax as opposed to standard normalizati

Very Short Explanation

Short Explanation

Long Explanation

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间