I am currently trying to implement a machine learning algorithm that involves the logistic loss function in MATLAB. Unfortunately, I am having some trouble due to numerical overflow.
In general, for a given an input s
, the value of the logistic function is:
log(1 + exp(s))
and the slope of the logistic loss function is:
exp(s)./(1 + exp(s)) = 1./(1 + exp(-s))
In my algorithm, the value of s = X*beta
. Here X
is a matrix with N
data points and P
features per data point (i.e. size(X)=[N,P]
) and beta
is a vector of P
coefficients for each feature such that size(beta)=[P 1]
.
I am specifically interested in calculating the average value and gradient of the Logistic function for given value of beta
.
The average value of the Logistic function w.r.t to a value of beta
is:
L = 1/N * sum(log(1+exp(X*beta)),1)
The average value of the slope of the Logistic function w.r.t. to a value of b
is:
dL = 1/N * sum((exp(X*beta)./(1+exp(X*beta))' X, 1)'
Note that size(dL) = [P 1].
My issue is that these expressions keep producing numerical overflows. The problem effectively comes from the fact that exp(s)=Inf
when s>1000
and exp(s)=0
when s<-1000.
I am looking for a solution such that s
can take on any value in floating point arithmetic. Ideally, I would also really appreciate a solution that allows me to evaluate the value and gradient in a vectorized / efficient way.
How about the following approximations:
– For computing L
, if s
is large, then exp(s)
will be much larger than 1:
1 + exp(s) ≅ exp(s)
and consequently
log(1 + exp(s)) ≅ log(exp(s)) = s.
If s
is small, then using the Taylor series of exp()
exp(s) ≅ 1 + s
and using the Taylor series of log()
log(1 + exp(s)) ≅ log(2 + s) ≅ log(2) + s / 2.
– For computing dL
, for large s
exp(s) ./ (1 + exp(s)) ≅ 1
and for small s
exp(s) ./ (1 + exp(s)) ≅ 1/2 + s / 4.
– The code to compute L
could look for example like this:
s = X*beta;
l = log(1+exp(s));
ind = isinf(l);
l(ind) = s(ind);
ind = (l == 0);
l(ind) = log(2) + s(ind) / 2;
L = 1/N * sum(l,1)
I found a good article about this problem.
Cutting through a lot of words, we can simplify the argument to stating that the original expression
log(1 + exp(s))
can be rewritten as
log(exp(s)*(exp(-s) + 1))
= log(exp(s)) + log(exp(-s) + 1)
= s + log(exp(-s) + 1)
This stops overflow from occurring - it doesn't prevent underflow, but by the time that occurs, you have your answer (namely, s
). You can't just use this instead of the original, since it will still give you problems. However, we now have the basis for a function that can be written that will be accurate and won't produce over/underflow:
function LL = logistic(s)
if s<0
LL = log(1 + exp(s));
else
LL = s + logistic(-s);
I think this maintains reasonably good accuracy.
EDIT now to the meat of your question - making this vectorized, and allowing the calculation of the slope as well. Let's take these one at a time:
function LL = logisticVec(s)
LL = zeros(size(s));
LL(s<0) = log(1 + exp(s(s<0)));
LL(s>=0) = s(s>=0) + log(1 + exp(-s(s>=0)));
To obtain the average you wanted:
L = logisticVec(X*beta) / N;
The slope is a little bit trickier; note I believe you may have a typo in your expression (missing a multiplication sign).
dL/dbeta = sum(X * exp(X*beta) ./ (1 + exp(X*beta))) / N;
If we divide top and bottom by exp(X*beta)
we get
dL = sum(X ./ (exp(-X*beta) + 1)) / N;
Once again, the overflow has gone away and we are left with underflow - but since the underflowed value has 1
added to it, the error this creates is insignificant.