I was looking at the example code for processing gradients that TensorFlow has:
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)
however, I noticed that the apply_gradients
function was derived from the GradientDescentOptimizer
. Does that mean that using the example code from above, one can only implement gradient like descent rules (notice we could change the opt = GradientDescentOptimizer
or Adam
or any of the the other optimizers)? In particular, what does apply_gradients
do? I definitively check the code in the tf github page but it was a bunch of python that had nothing to do with mathematical expressions, so it was hard to tell what that was doing and how it changed from optimizer to optimizer.
For example, if I wanted to implement my own custom optimizer that might use gradients (or might not e.g. just change the weights directly with some rule, maybe more biologically plausible rule), its not possible with the above example code?
In particular I wanted to implement a gradient descent version that is artificially restricted in a compact domain. In particular I wanted to implement the following equation:
w := (w - mu*grad + eps) mod B
in TensorFlow. I realized that the following is true:
w := w mod B - mu*grad mod B + eps mod B
so I thought that I could just implement it by doing:
def Process_grads(g,mu_noise,stddev_noise,B):
return (g+tf.random_normal(tf.shape(g),mean=mu_noise,stddev=stddev_noise) ) % B
and then just having:
processed_grads_and_vars = [(Process_grads(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the processed gradients.
opt.apply_gradients(processed_grads_and_vars)
however, I realized that that wasn't good enough because I don't actually have access to w
so I can't implement:
w mod B
at least not the way I tried. Is there a way to do this? i.e. to actually directly change the update rule? At least the way I tried?
I know its sort of a hacky update rule, but my point is more to change the update equation than actually caring to much about that update rule (so don't get hung up on it if its a bit weird).
I came up with super hacky solution:
def manual_update_GDL(arg,learning_rate,g,mu_noise,stddev_noise):
with tf.variable_scope(arg.mdl_scope_name,reuse=True):
W_var = tf.get_variable(name='W')
eps = tf.random_normal(tf.shape(g),mean=mu_noise,stddev=stddev_noise)
#
W_new = tf.mod( W_var - learning_rate*g + eps , 20)
sess.run( W_var.assign(W_new) )
def manual_GDL(arg,loss,learning_rate,mu_noise,stddev_noise,compact,B):
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss)
# process gradients
processed_grads_and_vars = [(manual_update_GDL(arg,learning_rate,g,mu_noise,stddev_noise), v) for g,v in grads_and_vars]
not sure if it works but something like that should work in general. The idea is to just write down the equation one wants to use (in TensorFlow) for the learning rate and then update the weights manually using a session.
Unfortunately, such a solution means we have to take care of the annealing (decaying learning rate manually which seems annoying). This solution probably has many other problems, feel free to point them out (and give solutions if you can).
For this very simple problem I realized one can just do the normal optimizer update rule and then just take the mod of the weights and re-assign them to their value:
sess.run(fetches=train_step)
if arg.compact:
# apply w := ( w - mu*g + eps ) mod B
W_val = W_var.eval()
W_new = tf.mod(W_var,arg.B).eval()
W_var.assign(W_new).eval()
but in this case its a coincidence that such a simple solution exists (unfortunately, bypasses the whole point of my question).
Actually, this solutions slows down the code a lot. For the moment is the best that I've got.
As a reference, I have seen this question: How to create an optimizer in Tensorflow , but didn't find it responded directly to my question.