I am currently trying to implement a machine learning algorithm that involves the logistic loss function in MATLAB. Unfortunately, I am having some trouble due to numerical overflow.
In general, for a given an input s
, the value of the logistic function is:
log(1 + exp(s))
and the slope of the logistic loss function is:
exp(s)./(1 + exp(s)) = 1./(1 + exp(-s))
In my algorithm, the value of s = X*beta
. Here X
is a matrix with N
data points and P
features per data point (i.e. size(X)=[N,P]
) and beta
is a vector of P
coefficients for each feature such that size(beta)=[P 1]
.
I am specifically interested in calculating the average value and gradient of the Logistic function for given value of beta
.
The average value of the Logistic function w.r.t to a value of beta
is:
L = 1/N * sum(log(1+exp(X*beta)),1)
The average value of the slope of the Logistic function w.r.t. to a value of b
is:
dL = 1/N * sum((exp(X*beta)./(1+exp(X*beta))' X, 1)'
Note that size(dL) = [P 1].
My issue is that these expressions keep producing numerical overflows. The problem effectively comes from the fact that exp(s)=Inf
when s>1000
and exp(s)=0
when s<-1000.
I am looking for a solution such that s
can take on any value in floating point arithmetic. Ideally, I would also really appreciate a solution that allows me to evaluate the value and gradient in a vectorized / efficient way.
How about the following approximations:
– For computing
L
, ifs
is large, thenexp(s)
will be much larger than 1:and consequently
If
s
is small, then using the Taylor series of exp()and using the Taylor series of log()
– For computing
dL
, for larges
and for small
s
– The code to compute
L
could look for example like this:I found a good article about this problem.
Cutting through a lot of words, we can simplify the argument to stating that the original expression
can be rewritten as
This stops overflow from occurring - it doesn't prevent underflow, but by the time that occurs, you have your answer (namely,
s
). You can't just use this instead of the original, since it will still give you problems. However, we now have the basis for a function that can be written that will be accurate and won't produce over/underflow:I think this maintains reasonably good accuracy.
EDIT now to the meat of your question - making this vectorized, and allowing the calculation of the slope as well. Let's take these one at a time:
To obtain the average you wanted:
The slope is a little bit trickier; note I believe you may have a typo in your expression (missing a multiplication sign).
If we divide top and bottom by
exp(X*beta)
we getOnce again, the overflow has gone away and we are left with underflow - but since the underflowed value has
1
added to it, the error this creates is insignificant.