Can one only implement gradient descent like optim

2019-06-17 06:47发布

问题:

I was looking at the example code for processing gradients that TensorFlow has:

# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)

# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)

# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]

# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)

however, I noticed that the apply_gradients function was derived from the GradientDescentOptimizer. Does that mean that using the example code from above, one can only implement gradient like descent rules (notice we could change the opt = GradientDescentOptimizer or Adam or any of the the other optimizers)? In particular, what does apply_gradients do? I definitively check the code in the tf github page but it was a bunch of python that had nothing to do with mathematical expressions, so it was hard to tell what that was doing and how it changed from optimizer to optimizer.

For example, if I wanted to implement my own custom optimizer that might use gradients (or might not e.g. just change the weights directly with some rule, maybe more biologically plausible rule), its not possible with the above example code?


In particular I wanted to implement a gradient descent version that is artificially restricted in a compact domain. In particular I wanted to implement the following equation:

w := (w - mu*grad + eps) mod B

in TensorFlow. I realized that the following is true:

w := w mod B - mu*grad mod B + eps mod B

so I thought that I could just implement it by doing:

def Process_grads(g,mu_noise,stddev_noise,B):
    return (g+tf.random_normal(tf.shape(g),mean=mu_noise,stddev=stddev_noise) ) % B

and then just having:

processed_grads_and_vars = [(Process_grads(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the processed gradients.
opt.apply_gradients(processed_grads_and_vars)

however, I realized that that wasn't good enough because I don't actually have access to w so I can't implement:

w mod B

at least not the way I tried. Is there a way to do this? i.e. to actually directly change the update rule? At least the way I tried?

I know its sort of a hacky update rule, but my point is more to change the update equation than actually caring to much about that update rule (so don't get hung up on it if its a bit weird).


I came up with super hacky solution:

def manual_update_GDL(arg,learning_rate,g,mu_noise,stddev_noise):
    with tf.variable_scope(arg.mdl_scope_name,reuse=True):
        W_var = tf.get_variable(name='W')
        eps = tf.random_normal(tf.shape(g),mean=mu_noise,stddev=stddev_noise)
        #
        W_new = tf.mod( W_var - learning_rate*g + eps , 20)
        sess.run( W_var.assign(W_new) )

def manual_GDL(arg,loss,learning_rate,mu_noise,stddev_noise,compact,B):
    # Compute the gradients for a list of variables.
    grads_and_vars = opt.compute_gradients(loss)
    # process gradients
    processed_grads_and_vars = [(manual_update_GDL(arg,learning_rate,g,mu_noise,stddev_noise), v) for g,v in grads_and_vars]

not sure if it works but something like that should work in general. The idea is to just write down the equation one wants to use (in TensorFlow) for the learning rate and then update the weights manually using a session.

Unfortunately, such a solution means we have to take care of the annealing (decaying learning rate manually which seems annoying). This solution probably has many other problems, feel free to point them out (and give solutions if you can).


For this very simple problem I realized one can just do the normal optimizer update rule and then just take the mod of the weights and re-assign them to their value:

sess.run(fetches=train_step)
if arg.compact:
    # apply w := ( w - mu*g + eps ) mod B
    W_val = W_var.eval()
    W_new = tf.mod(W_var,arg.B).eval()
    W_var.assign(W_new).eval()

but in this case its a coincidence that such a simple solution exists (unfortunately, bypasses the whole point of my question).

Actually, this solutions slows down the code a lot. For the moment is the best that I've got.


As a reference, I have seen this question: How to create an optimizer in Tensorflow , but didn't find it responded directly to my question.

回答1:

Indeed you are somewhat restricted and cannot do anything. However, what you wish to do can easily be done by making your daughter class of the tensorflow Optimizer class.

All you need to do is write an _apply_dense method for your class. The _apply_dense method takes grad and w as arguments, so anything you want to do with these to variables you can do.

Look here for example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/adam.py That is the implementation of Adam in tensorflow, all you need to do is change the _apply_dense at line 131 as well as the _prepare and _finish methods.

So for example:

def _apply_dense(self, grad, var):
    B = math_ops.cast(self.B, var.dtype.base_dtype)
    eps = math_ops.cast(self.eps, var.dtype.base_dtype)
    mu = math_ops.cast(self.mu, var.dtype.base_dtype)


    var_update = state_ops.assign(var, tf.floormod(var - mu*grad + eps,B),
                           use_locking=self._use_locking)

    return var_update


回答2:

Your solution slows down the code because you use the sess.run and .eval() code during your "train_step" creation. Instead you should create the train_step graph using only internal tensorflow functions (without using sess.run and .eval()). Thereafter you only evaluate the train_step in a loop.

If you don't want to use any standard optimizer you can write your own "apply gradient" graph. Here is one possible solution for that:

learning_rate = tf.Variable(tf.constant(0.1))
mu_noise = 0.
stddev_noise = 0.01

#add all your W variables here when you have more than one:
train_w_vars_list = [W]
grad = tf.gradients(some_loss, train_w_vars_list)

assign_list = []
for g, v in zip(grad, train_w_vars_list):
  eps = tf.random_normal(tf.shape(g), mean=mu_noise, stddev=stddev_noise)
  assign_list.append(v.assign(tf.mod(v - learning_rate*g + eps, 20)))

#also update the learning rate here if you want to:
assign_list.append(learning_rate.assign(learning_rate - 0.001))

train_step = tf.group(*assign_list)

You can also use one of the standard optimizer to create the grads_and_vars list (use it instead of zip(grad, train_w_vars_list) then).

Here is a simple example for MNIST with your loss:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from tensorflow.examples.tutorials.mnist import input_data

import tensorflow as tf

# Import data
mnist = input_data.read_data_sets('PATH TO MNIST_data', one_hot=True)

# Create the model
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
y = tf.matmul(x, W)


# Define loss and optimizer
y_ = tf.placeholder(tf.float32, [None, 10])

cross_entropy = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))

learning_rate = tf.Variable(tf.constant(0.1))
mu_noise = 0.
stddev_noise = 0.01

#add all your W variables here when you have more than one:
train_w_vars_list = [W]
grad = tf.gradients(cross_entropy, train_w_vars_list)

assign_list = []
for g, v in zip(grad, train_w_vars_list):
  eps = tf.random_normal(tf.shape(g), mean=mu_noise, stddev=stddev_noise)
  assign_list.append(v.assign(tf.mod(v - learning_rate*g + eps, 20)))

#also update the learning rate here if you want to:
assign_list.append(learning_rate.assign(learning_rate - 0.001))

train_step = tf.group(*assign_list)


sess = tf.InteractiveSession()
tf.global_variables_initializer().run()


# Train
for _ in range(1000):
  batch_xs, batch_ys = mnist.train.next_batch(100)
  sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})


# Test trained model
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images,
                                    y_: mnist.test.labels}))