In LSTM Network (Understanding LSTMs), Why input gate and output gate use tanh? what is the intuition behind this? it is just a nonlinear transformation? if it is, can I change both to another activation function (e.g. ReLU)?
相关问题
- How to use Reshape keras layer with two None dimen
- How to conditionally scale values in Keras Lambda
- Trying to understand Pytorch's implementation
- Convolutional Neural Network seems to be randomly
- How to convert Onnx model (.onnx) to Tensorflow (.
相关文章
- How to downgrade to cuda 10.0 in arch linux?
- How to use cross_val_score with random_state
- How to measure overfitting when train and validati
- McNemar's test in Python and comparison of cla
- How to disable keras warnings?
- TensorFlow Eager Mode: How to restore a model from
- Invert MinMaxScaler from scikit_learn
- How should I vectorize the following list of lists
Sigmoid
specifically, is used as the gating function for the 3 gates(in, out, forget) inLSTM
, since it outputs a value between 0 and 1, it can either let no flow or complete flow of information throughout the gates. On the other hand, to overcome the vanishing gradient problem, we need a function whose second derivative can sustain for a long range before going to zero.Tanh
is a good function with the above property.A good neuron unit should be bounded, easily differentiable, monotonic (good for convex optimization) and easy to handle. If you consider these qualities, then i believe you can use
ReLU
in place oftanh
function since they are very good alternatives of each other. But before making a choice for activation functions, you must know what are the advantages and disadvantages of your choice over others. I am shortly describing some of the activation functions and their advantages.Sigmoid
Mathematical expression:
sigmoid(z) = 1 / (1 + exp(-z))
1st order derivative:
sigmoid'(z) = -exp(-z) / 1 + exp(-z)^2
Advantages:
Tanh
Mathematical expression:
tanh(z) = [exp(z) - exp(-z)] / [exp(z) + exp(-z)]
1st order derivative:
tanh'(z) = 1 - ([exp(z) - exp(-z)] / [exp(z) + exp(-z)])^2 = 1 - tanh^2(z)
Advantages:
Hard Tanh
Mathematical expression:
hardtanh(z) = -1 if z < -1; z if -1 <= z <= 1; 1 if z > 1
1st order derivative:
hardtanh'(z) = 1 if -1 <= z <= 1; 0 otherwise
Advantages:
ReLU
Mathematical expression:
relu(z) = max(z, 0)
1st order derivative:
relu'(z) = 1 if z > 0; 0 otherwise
Advantages:
Leaky ReLU
Mathematical expression:
leaky(z) = max(z, k dot z) where 0 < k < 1
1st order derivative:
relu'(z) = 1 if z > 0; k otherwise
Advantages:
This paper explains some fun activation function. You may consider to read it.
LSTMs manage an internal state vector whose values should be able to increase or decrease when we add the output of some function. Sigmoid output is always non-negative; values in the state would only increase. The output from tanh can be positive or negative, allowing for increases and decreases in the state.
That's why tanh is used to determine candidate values to get added to the internal state. The GRU cousin of the LSTM doesn't have a second tanh, so in a sense the second one is not necessary. Check out the diagrams and explanations in Chris Olah's Understanding LSTM Networks for more.
The related question, "Why are sigmoids used in LSTMs where they are?" is also answered based on the possible outputs of the function: "gating" is achieved by multiplying by a number between zero and one, and that's what sigmoids output.
There aren't really meaningful differences between the derivatives of sigmoid and tanh; tanh is just a rescaled and shifted sigmoid: see Richard Socher's Neural Tips and Tricks. If second derivatives are relevant, I'd like to know how.