How to handle gradients when training two sub-grap

The general idea I am trying to realize is a seq2seq-model (taken from the translate.py-example in the models, based on the seq2seq-class). This trains well.

Furthermore I am using the hidden state of the rnn after all the encoding is done, right before decoding starts (I call it the “hidden state at end of encoding”). I use this hidden state at end of encoding to feed it into a further sub-graph which I call “prices” (see below). The training gradients of this sub-graph backprop not only through this additional sub-graph, but also back into the encoder-part of the rnn (which is what I want and need).

The plan is to add more such sub-graph to the hidden state at end of encoding, as I want to analyze the input phrases in a variety of ways.

Now during training when I evaluate and train both sub-graphs (encoder+prices AND encoder+decoder) at the same time, the net does NOT converge. However, if I train by executing the training in the following way (pseudo-code):

if global_step % 10 == 0:
    execute-the-price-training_code
else:
    execute-the-decoder-training_code

So I am not training both sub-graphs simultaneously. Now it does converge, but the encoder+decoder-part converges MUCH slower than if I ONLY train this part and never train the prices-sub-graph.

My question is: I should be able to train both sub-graphs simultaneously. But probably I have to rescale the gradients flowing back into the hidden state at end of encoding. Here we get the gradients from the prices sub-graph AND from the decoder-sub-graph. How should this rescaling be done. I didnt find any papers describing such an undertaking, but maybe I am searching with the wrong keywords.

Here is the training-part of the code:

This is the (almost original) training-op-preparation:

if not forward_only:
  self.gradient_norms = []
  self.updates = []
  opt = tf.train.AdadeltaOptimizer(self.learning_rate)

  for bucket_id in xrange(len(buckets)):
    tf.scalar_summary("seq2seq loss", self.losses[bucket_id])

    gradients = tf.gradients(self.losses[bucket_id], var_list_seq2seq)
    clipped_gradients, norm = tf.clip_by_global_norm(gradients, max_gradient_norm)
    self.gradient_norms.append(norm)
    self.updates.append(opt.apply_gradients(zip(clipped_gradients, var_list_seq2seq), global_step=self.global_step))

Now, additionally, I am running a second sub-graph that takes the hidden state at end of encoding as input:

  with tf.name_scope('prices') as scope:
    #First layer
    W_price_first_layer = tf.Variable(tf.random_normal([num_layers*size, self.prices_hidden_layer_size], stddev=0.35), name="W_price_first_layer")
    B_price_first_layer = tf.Variable(tf.zeros([self.prices_hidden_layer_size]), name="B_price_first_layer")
    self.output_price_first_layer = tf.add(tf.matmul(self.hidden_state, W_price_first_layer), B_price_first_layer)
    self.activation_price_first_layer = tf.nn.sigmoid(self.output_price_first_layer)
    #self.activation_price_first_layer = tf.nn.Relu(self.output_price_first_layer)

    #Second layer to softmax (price ranges)
    W_price = tf.Variable(tf.random_normal([self.prices_hidden_layer_size, self.prices_bit_size], stddev=0.35), name="W_price")
    W_price_t = tf.transpose(W_price)
    B_price = tf.Variable(tf.zeros([self.prices_bit_size]), name="B_price")

    self.output_price_second_layer = tf.add(tf.matmul(self.activation_price_first_layer, W_price),B_price)
    self.price_prediction = tf.nn.softmax(self.output_price_second_layer)
    self.label_price      = tf.placeholder(tf.int32, shape=[self.batch_size], name="price_label")

    #Remember the prices trainables
    var_list_prices       = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "prices")
    var_list_all          = tf.trainable_variables()

    #Backprop
    self.loss_price        = tf.nn.sparse_softmax_cross_entropy_with_logits(self.output_price_second_layer, self.label_price)
    self.loss_price_scalar = tf.reduce_mean(self.loss_price)
    self.optimizer_price   = tf.train.AdadeltaOptimizer(self.learning_rate_prices)
    self.training_op_price = self.optimizer_price.minimize(self.loss_price, var_list=var_list_all)

Thx a bunch

I expect that running two optimizers simultaneously will lead to inconsistent gradient updates on the common variables, and this might be causing your training not to converge.

Instead, if you add the scalar loss from each sub-network to the "losses collection" (e.g. via tf.contrib.losses.add_loss() or tf.add_to_collection(tf.GraphKeys.LOSSES, ...), you can use tf.contrib.losses.get_total_loss() to get a single loss value that can be passed to a single standard TensorFlow tf.train.Optimizer subclass. TensorFlow will derive the appropriate back-prop computation for your split network.

The get_total_loss() method simply computes an unweighted sum of the values that have been added to the losses collection. I'm not familiar with the literature on how or if you should scale these values, but you can use any arbitrary (differentiable) TensorFlow expression to combine the losses and pass the result to a single optimizer.