Neural Activation Functions - Difference between L

2019-03-15 10:55发布

问题:

I'm writing some basic neural network methods - specifically the activation functions - and have hit the limits of my rubbish knowledge of math. I understand the respective ranges (-1/1) (0/1) etc, but the varying descriptions and implementations have me confused.

Specifically sigmoid, logistic, bipolar sigmoid, tanh, etc.

Does sigmoid simply describe the shape of the function irrespective of range? If so, then is tanh a 'sigmoid function'?

I have seen 'bipolar sigmoid' compared against 'tanh' in a paper, however I have seen both functions implemented (in various libraries) with the same code:

(( 2/ (1 + Exp(-2 * n))) - 1). Are they exactly the same thing?

Likewise, I have seen logistic and sigmoid activations implemented with the same code:

( 1/ (1 + Exp(-1 * n))). Are these also equivalent?

Lastly, does it even matter that much in practise? I see on wiki a plot of very similar sigmoid functions - could any of these be used? Some look like they may be considerably faster to compute than others.

回答1:

Logistic function: ex/(ex + ec)

Special ("standard") case of the logistic function: 1/(1 + e-x)

Bipolar sigmoid: never heard of it.

Tanh: (ex-e-x)/(ex + e-x)

Sigmoid usually refers to the shape (and limits), so yes, tanh is a sigmoid function. But in some contexts it refers specifically to the standard logistic function, so you have to be careful. And yes, you could use any sigmoid function and probably do just fine.

(( 2/ (1 + Exp(-2 * x))) - 1) is equivalent to tanh(x).



回答2:

Generally the most important differences are a. smooth continuously differentiable like tanh and logistic vs step or truncated b. competitive vs transfer c. sigmoid vs radial d. symmetric (-1,+1) vs asymmetric (0,1)

Generally the differentiable requirement is needed for hidden layers and tanh is often recommended as being more balanced. The 0 for tanh is at the fastest point (highest gradient or gain) and not a trap, while for logistic 0 is the lowest point and a trap for anything pushing deeper into negative territory. Radial (basis) functions are about distance from a typical prototype and good for convex circular regions about a neuron, while the sigmoid functions are about separating linearly and good for half spaces - and it will require many for good approximation to a convex region, with circular/spherical regions being worst for sigmoids and best for radials.

Generally, the recommendation is for tanh on the intermediate layers for +/- balance, and suit the output layer to the task (boolean/dichotomous class decision with threshold, logistic or competitive outputs (e.g. softmax, a self-normalizing multiclass generalization of logistic); regression tasks can even be linear). The output layer doesn't need to be continuously differentiable. The input layer should be normalized in some way, either to [0,1] or better still standardization or normalization with demeaning to [-1,+1]. If you include a dummy input of 1 then normalize so ||x||p = 1 you are dividing by a sum or length and this magnitude information is retained in the dummy bias input rather than being lost. If you normalize over examples, this is technically interfering with your test data if you look at them, or they may be out of range if you don't. But with ||2 normalization such variations or errors should approach the normal distribution if they are effects of natural distribution or error. This means that they with high probability they won't exceed the original range (probably around 2 standard deviations) by more than a small factor (viz. such overrange values are regarded as outliers and not significant).

So I recommend unbiased instance normalization or biased pattern standardization or both on the input layer (possibly with data reduction with SVD), tanh on the hidden layers, and a threshold function, logistic function or competitive function on the output for classification, but linear with unnormalized targets or perhaps logsig with normalized targets for regression.



回答3:

The word is (and I've tested) that in some cases it might be better to use the tanh than the logistic since

  1. Outputs near Y = 0 on the logistic times a weight w yields a value near 0 which doesn't have much effect on the upper layers which it affects (although absence also affects), however a value near Y = -1 on tahn times a weight w might yield a large number which has more numeric effect.
  2. The derivative of tanh (1 - y^2) yields values greater than the logistic (y (1 -y) = y - y^2). For example, when z = 0, the logistic function yields y = 0.5 and y' = 0.25, for tanh y = 0 but y' = 1 (you can see this in general just by looking at the graph). MEANING that a tanh layer might learn faster than a logistic layer because of the magnitude of the gradient.


回答4:

Bipolar sigmoid = (1-e^(-x))/(1 + e^(-x))

Detailed explanation can be found at here