From the Udacity's deep learning class, the softmax of y_i is simply the exponential divided by the sum of exponential of the whole Y vector:
Where S(y_i)
is the softmax function of y_i
and e
is the exponential and j
is the no. of columns in the input vector Y.
I've tried the following:
import numpy as np
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
scores = [3.0, 1.0, 0.2]
print(softmax(scores))
which returns:
[ 0.8360188 0.11314284 0.05083836]
But the suggested solution was:
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x) / np.sum(np.exp(x), axis=0)
which produces the same output as the first implementation, even though the first implementation explicitly takes the difference of each column and the max and then divides by the sum.
Can someone show mathematically why? Is one correct and the other one wrong?
Are the implementation similar in terms of code and time complexity? Which is more efficient?
So, this is really a comment to desertnaut's answer but I can't comment on it yet due to my reputation. As he pointed out, your version is only correct if your input consists of a single sample. If your input consists of several samples, it is wrong. However, desertnaut's solution is also wrong. The problem is that once he takes a 1-dimensional input and then he takes a 2-dimensional input. Let me show this to you.
Lets take desertnauts example:
This is the output:
You can see that desernauts version would fail in this situation. (It would not if the input was just one dimensional like np.array([1, 2, 3, 6]).
Lets now use 3 samples since thats the reason why we use a 2 dimensional input. The following x2 is not the same as the one from desernauts example.
This input consists of a batch with 3 samples. But sample one and three are essentially the same. We now expect 3 rows of softmax activations where the first should be the same as the third and also the same as our activation of x1!
I hope you can see that this is only the case with my solution.
Additionally, here is the results of TensorFlows softmax implementation:
And the result:
They're both correct, but yours is preferred from the point of view of numerical stability.
You start with
By using the fact that a^(b - c) = (a^b)/(a^c) we have
Which is what the other answer says. You could replace max(x) with any variable and it would cancel out.
(Well... much confusion here, both in the question and in the answers...)
To start with, the two solutions (i.e. yours and the suggested one) are not equivalent; they happen to be equivalent only for the special case of 1-D score arrays. You would have discovered it if you had tried also the 2-D score array in the Udacity quiz provided example.
Results-wise, the only actual difference between the two solutions is the
axis=0
argument. To see that this is the case, let's try your solution (your_softmax
) and one where the only difference is theaxis
argument:As I said, for a 1-D score array, the results are indeed identical:
Nevertheless, here are the results for the 2-D score array given in the Udacity quiz as a test example:
The results are different - the second one is indeed identical with the one expected in the Udacity quiz, where all columns indeed sum to 1, which is not the case with the first (wrong) result.
So, all the fuss was actually for an implementation detail - the
axis
argument. According to the numpy.sum documentation:while here we want to sum row-wise, hence
axis=0
. For a 1-D array, the sum of the (only) row and the sum of all the elements happen to be identical, hence your identical results in that case...The
axis
issue aside, your implementation (i.e. your choice to subtract the max first) is actually better than the suggested solution! In fact, it is the recommended way of implementing the softmax function - see here for the justification (numeric stability, also pointed out by some answers above).I would like to supplement a little bit more understanding of the problem. Here it is correct of subtracting max of the array. But if you run the code in the other post, you would find it is not giving you right answer when the array is 2D or higher dimensions.
Here I give you some suggestions:
Follow the result you will get the correct answer by doing vectorization. Since it is related to the college homework, I cannot post the exact code here, but I would like to give more suggestions if you don't understand.
From mathematical point of view both sides are equal.
And you can easily prove this. Let's
m=max(x)
. Now your functionsoftmax
returns a vector, whose i-th coordinate is equal tonotice that this works for any
m
, because for all (even complex) numberse^m != 0
from computational complexity point of view they are also equivalent and both run in
O(n)
time, wheren
is the size of a vector.from numerical stability point of view, the first solution is preferred, because
e^x
grows very fast and even for pretty small values ofx
it will overflow. Subtracting the maximum value allows to get rid of this overflow. To practically experience the stuff I was talking about try to feedx = np.array([1000, 5])
into both of your functions. One will return correct probability, the second will overflow withnan
not related to the question, but your solution works only for vectors (Udacity quiz wants you to calculate it for matrices as well). In order to fix it you need to use
sum(axis=0)
EDIT. As of version 1.2.0, scipy includes softmax as a special function:
https://scipy.github.io/devdocs/generated/scipy.special.softmax.html
I wrote a function applying the softmax over any axis:
Subtracting the max, as other users described, is good practice. I wrote a detailed post about it here.