I'm trying to use the dynamic_rnn
function in Tensorflow to speed up training. After doing some reading, my understanding is that one way to speed up training is to explicitly pass a value to the sequence_length
parameter in this function. After a bit more reading, and finding this SO explanation, it seems like what I need to pass is a vector (maybe defined by a tf.placeholder
) that contains the length of each sequence within a batch.
Here's where I'm confused: in order to take advantage of this, should I pad each of my batches to the longest-length sequence within the batch instead of the longest-length sequence in the training set? How does Tensorflow handle the remaining zeros/pad-tokens in any of the shorter sequences? Also, is the main advantage here really speed, or just extra assurance that we're masking pad-tokens during training? Any help/context would be appreciated.
The sequences within a batch must be aligned, i.e., have to have the same length. So the general answer to your question is "yes". But different batches doesn't have to be of the same length, so you can stratify input sequences into groups that have roughly the same size and pad them accordingly. This technique is called bucketing and you can read about it in this tutorial.
Pretty much intuitive.
tf.nn.dynamic_rnn
returns two tensors:output
andstates
. Suppose the actual sequence length ist
and the padded sequence length isT
.Then the
output
will contain zeros afteri > t
andstates
will contain thet
-th cell state, ignoring the states of trailing cells.Here's an example:
Note that instance 1 is padded, so
outputs_val[1,1]
is a zero vector andstates_val[1] == outputs_val[1,0]
:Of course, batch processing is more efficient, than feeding the sequences one by one. But the main advantage of specifying the length is that you get the reasonable state out of RNN, i.e., padded items don't affect the result tensor. You will get exactly the same result (and the same speed) if you don't set the length, but select the right states manually.