I'm looking at a tensorflow network implementing reinforcement-learning for the CartPole open-ai env.
The network implements the likelihood ratio approach for a policy gradient agent.
The thing is, that the policy loss is defined using the gather_nd
op!! here, look:
....
self.y = tf.nn.softmax(tf.matmul(self.W3,self.h2) + self.b3,dim=0)
self.curr_reward = tf.placeholder(shape=[None],dtype=tf.float32)
self.actions_array = tf.placeholder(shape=[None,2],dtype=tf.int32)
self.pai_array = tf.gather_nd(self.y,self.actions_array)
self.L = -tf.reduce_mean(tf.log(self.pai_array)*self.curr_reward)
And then they take the derivative of this loss with respect to all the parameters of the network:
self.gradients = tf.gradients(self.L,tf.trainable_variables())
How can this be?? I thought that the whole point in neural networks is always working with differentiable ops, like cross-entropy
and never do something strange like selecting indexes of self.y
according to some self.actions_array
selected by random and clearly not differentiable.
What am I missing here? thanks!