可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm trying to implement gradient calculation for neural networks using backpropagation. I cannot get it to work with cross entropy error and rectified linear unit (ReLU) as activation.

I managed to get my implementation working for squared error with sigmoid, tanh and ReLU activation functions. Cross entropy (CE) error with sigmoid activation gradient is computed correctly. However, when I change activation to ReLU - it fails. (I'm skipping tanh for CE as it retuls values in (-1,1) range.)

Is it because of the behavior of log function at values close to 0 (which is returned by ReLUs approx. 50% of the time for normalized inputs)? I tried to mitiage that problem with:

log(max(y,eps))

but it only helped to bring error and gradients back to real numbers - they are still different from numerical gradient.

I verify the results using numerical gradient:

num_grad = (f(W+epsilon) - f(W-epsilon)) / (2*epsilon)

The following matlab code presents a simplified and condensed backpropagation implementation used in my experiments:

function [f, df] = backprop(W, X, Y)
% W - weights
% X - input values
% Y - target values

act_type='relu';    % possible values: sigmoid / tanh / relu
error_type = 'CE';  % possible values: SE / CE

N=size(X,1); n_inp=size(X,2); n_hid=100; n_out=size(Y,2);
w1=reshape(W(1:n_hid*(n_inp+1)),n_hid,n_inp+1);
w2=reshape(W(n_hid*(n_inp+1)+1:end),n_out, n_hid+1);

% feedforward
X=[X ones(N,1)];
z2=X*w1'; a2=act(z2,act_type); a2=[a2 ones(N,1)];
z3=a2*w2'; y=act(z3,act_type);

if strcmp(error_type, 'CE')   % cross entropy error - logistic cost function
    f=-sum(sum( Y.*log(max(y,eps))+(1-Y).*log(max(1-y,eps)) ));
else % squared error
    f=0.5*sum(sum((y-Y).^2));
end

% backprop
if strcmp(error_type, 'CE')   % cross entropy error
    d3=y-Y;
else % squared error
    d3=(y-Y).*dact(z3,act_type);
end

df2=d3'*a2;
d2=d3*w2(:,1:end-1).*dact(z2,act_type);
df1=d2'*X;

df=[df1(:);df2(:)];

end

function f=act(z,type) % activation function
switch type
    case 'sigmoid'
        f=1./(1+exp(-z));
    case 'tanh'
        f=tanh(z);
    case 'relu'
        f=max(0,z);
end
end

function df=dact(z,type) % derivative of activation function
switch type
    case 'sigmoid'
        df=act(z,type).*(1-act(z,type));
    case 'tanh'
        df=1-act(z,type).^2;
    case 'relu'
        df=double(z>0);
end
end

Edit

After another round of experiments, I found out that using a softmax for the last layer:

y=bsxfun(@rdivide, exp(z3), sum(exp(z3),2));

and softmax cost function:

f=-sum(sum(Y.*log(y)));

make the implementaion working for all activation functions including ReLU.

This leads me to conclusion that it is the logistic cost function (binary clasifier) that does not work with ReLU:

f=-sum(sum( Y.*log(max(y,eps))+(1-Y).*log(max(1-y,eps)) ));

However, I still cannot figure out where the problem lies.

回答1:

Every squashing function sigmoid, tanh and softmax (in the output layer) means different cost functions. Then makes sense that a RLU (in the output layer) does not match with the cross entropy cost function. I will try a simple square error cost function to test a RLU output layer.

The true power of RLU is in the hidden layers of a deep net since it not suffer from gradient vanishing error.

回答2:

If you use gradient descendent you need to derive the activation function to be used later in the back-propagation approach. Are you sure about the 'df=double(z>0)'?. For the logistic and tanh seems to be right.

Further, are you sure about this 'd3=y-Y' ? I would say this is true when you use the logistic function but not for the ReLu (the derivative is not the same and therefore will not lead to that simple equation).

You could use the softplus function that is a smooth version of the ReLU, which the derivative is well known (logistic function).

回答3:

I think the flaw lies in comapring with the numerically computed derivatives. In your derivativeActivation function , you define the derivative of ReLu at 0 to be 0. Where as numerically computing the derivative at x=0 shows it to be (ReLU(x+epsilon)-ReLU(x-epsilon)/(2*epsilon)) at x =0 which is 0.5. Therefore, defining the derivative of ReLU at x=0 to be 0.5 will solve the problem

回答4:

I thought I'd share my experience I had with similar problem. I too have designed my multi classifier ANN in a way that all hidden layers use RELU as non-linear activation function and the output layer uses softmax function.

My problem was related to some degree to numerical precision of the programming language/platform I was using. In my case I noticed that if I used "plain" RELU not only does it kill the gradient but the programming language I used produced the following softmax output vectors (this is just a example sample):

⎡1.5068230536681645e-35⎤
⎢ 2.520367499064734e-18⎥
⎢3.2572859518007807e-22⎥
⎢                     1⎥
⎢ 5.020155103452967e-32⎥
⎢1.7620297760773188e-18⎥
⎢ 5.216008990667109e-18⎥
⎢ 1.320937038894421e-20⎥
⎢2.7854159049317976e-17⎥
⎣1.8091246170996508e-35⎦

Notice the values of most of the elements are close to 0, but most importantly notice the 1 value in the output.

I used a different cross-entropy error function than the one you used. Instead of calculating log(max(1-y, eps)) I stuck to the basic log(1-y). So given the output vector above, when I calculated log(1-y) I got the -Inf as a result of cross-entropy, which obviously killed the algorithm.

I imagine if your eps is not reasonably high enough so that log(max(1-y, eps)) -> log(max(0, eps)) doesn't yield way too small log output you might be in a similar pickle like myself.

My solution to this problem was to use Leaky RELU. Once I've started using it, I could carry on using the multi classifier cross-entropy as oppose to softmax-cost function you decided to try.