EDIT(1/3/16): corresponding github issue
I'm using Tensorflow (Python interface) to implement a q-learning agent with function approximation trained using stochastic gradient-descent. At each iteration of the experiment a step function in the agent is called that updates the parameters of the approximator based on the new reward and activation, and then chooses a new action to perform.
Here is the problem(with reinforcement learning jargon):
- The agent computes its state-action value predictions to choose an action.
- Then gives control back another program which simulates a step in the environment.
- Now the agent's step function is called for the next iteration. I want to use Tensorflow's Optimizer class to compute the gradients for me. However, this requires both the state-action value predictions that I computed last step, AND their graph. So:
- If I run the optimizer on the whole graph, then it has to recompute the state-action value predictions.
- But, if I store the prediction (for the chosen action) as a variable, then feed it to the optimizer as a placeholder, it no longer has the graph necessary to compute the gradients.
- I can't just run it all in the same sess.run() statement, because I have to give up control and return the chosen action in order to get the next observation and reward (to use in the target for the loss function).
So, is there a way that I can (without reinforcement learning jargon):
- Compute part of my graph, returning value1.
- Return value1 to the calling program to compute value2
- In the next iteration, use value2 to as part of my loss function for gradient descent WITHOUT recomputing the the part of the graph that computes value1.
Of course, I've considered the obvious solutions:
Just hardcode the gradients: This would be easy for the really simple approximators I'm using now, but would be really inconvenient if I were experimenting with different filters and activation functions in a big convolutional network. I'd really like to use the Optimizer class if possible.
Call the environment simulation from within the agent: This system does this, but it would make mine more complicated, and remove a lot of the modularity and structure. So, I don't want to do this.
I've read through the API and whitepaper several times, but can't seem to come up with a solution. I was trying to come up with some way to feed the target into a graph to calculate the gradients, but couldn't come up with a way to build that graph automatically.
If it turns out this isn't possible in TensorFlow yet, do you think it would be very complicated to implement this as a new operator? (I haven't used C++ in a couple of years, so the TensorFlow source looks a little intimidating.) Or would I be better off switching to something like Torch, which has the imperative differentiation Autograd, instead of symbolic differentiation?
Thanks for taking the time to help me out on this. I was trying to make this as concise as I could.
EDIT: After doing some further searching I came across this previously asked question. It's a little different than mine (they are trying to avoid updating an LSTM network twice every iteration in Torch), and doesn't have any answers yet.
Here is some code if that helps:
'''
-Q-Learning agent for a grid-world environment.
-Receives input as raw rbg pixel representation of screen.
-Uses an artificial neural network function approximator with one hidden layer
2015 Jonathon Byrd
'''
import random
import sys
#import copy
from rlglue.agent.Agent import Agent
from rlglue.agent import AgentLoader as AgentLoader
from rlglue.types import Action
from rlglue.types import Observation
import tensorflow as tf
import numpy as np
world_size = (3,3)
total_spaces = world_size[0] * world_size[1]
class simple_agent(Agent):
#Contants
discount_factor = tf.constant(0.5, name="discount_factor")
learning_rate = tf.constant(0.01, name="learning_rate")
exploration_rate = tf.Variable(0.2, name="exploration_rate") # used to be a constant :P
hidden_layer_size = 12
#Network Parameters - weights and biases
W = [tf.Variable(tf.truncated_normal([total_spaces * 3, hidden_layer_size], stddev=0.1), name="layer_1_weights"),
tf.Variable(tf.truncated_normal([hidden_layer_size,4], stddev=0.1), name="layer_2_weights")]
b = [tf.Variable(tf.zeros([hidden_layer_size]), name="layer_1_biases"), tf.Variable(tf.zeros([4]), name="layer_2_biases")]
#Input placeholders - observation and reward
screen = tf.placeholder(tf.float32, shape=[1, total_spaces * 3], name="observation") #input pixel rgb values
reward = tf.placeholder(tf.float32, shape=[], name="reward")
#last step data
last_obs = np.array([1, 2, 3], ndmin=4)
last_act = -1
#Last step placeholders
last_screen = tf.placeholder(tf.float32, shape=[1, total_spaces * 3], name="previous_observation")
last_move = tf.placeholder(tf.int32, shape = [], name="previous_action")
next_prediction = tf.placeholder(tf.float32, shape = [], name="next_prediction")
step_count = 0
def __init__(self):
#Initialize computational graphs
self.q_preds = self.Q(self.screen)
self.last_q_preds = self.Q(self.last_screen)
self.action = self.choose_action(self.q_preds)
self.next_pred = self.max_q(self.q_preds)
self.last_pred = self.act_to_pred(self.last_move, self.last_q_preds) # inefficient recomputation
self.loss = self.error(self.last_pred, self.reward, self.next_prediction)
self.train = self.learn(self.loss)
#Summaries and Statistics
tf.scalar_summary(['loss'], self.loss)
tf.scalar_summary('reward', self.reward)
#w_hist = tf.histogram_summary("weights", self.W[0])
self.summary_op = tf.merge_all_summaries()
self.sess = tf.Session()
self.summary_writer = tf.train.SummaryWriter('tensorlogs', graph_def=self.sess.graph_def)
def agent_init(self,taskSpec):
print("agent_init called")
self.sess.run(tf.initialize_all_variables())
def agent_start(self,observation):
#print("agent_start called, observation = {0}".format(observation.intArray))
o = np.divide(np.reshape(np.asarray(observation.intArray), (1,total_spaces * 3)), 255)
return self.control(o)
def agent_step(self,reward, observation):
#print("agent_step called, observation = {0}".format(observation.intArray))
print("step, reward: {0}".format(reward))
o = np.divide(np.reshape(np.asarray(observation.intArray), (1,total_spaces * 3)), 255)
next_prediction = self.sess.run([self.next_pred], feed_dict={self.screen:o})[0]
if self.step_count % 10 == 0:
summary_str = self.sess.run([self.summary_op, self.train],
feed_dict={self.reward:reward, self.last_screen:self.last_obs,
self.last_move:self.last_act, self.next_prediction:next_prediction})[0]
self.summary_writer.add_summary(summary_str, global_step=self.step_count)
else:
self.sess.run([self.train],
feed_dict={self.screen:o, self.reward:reward, self.last_screen:self.last_obs,
self.last_move:self.last_act, self.next_prediction:next_prediction})
return self.control(o)
def control(self, observation):
results = self.sess.run([self.action], feed_dict={self.screen:observation})
action = results[0]
self.last_act = action
self.last_obs = observation
if (action==0): # convert action integer to direction character
action = 'u'
elif (action==1):
action = 'l'
elif (action==2):
action = 'r'
elif (action==3):
action = 'd'
returnAction=Action()
returnAction.charArray=[action]
#print("return action returned {0}".format(action))
self.step_count += 1
return returnAction
def Q(self, obs): #calculates state-action value prediction with feed-forward neural net
with tf.name_scope('network_inference') as scope:
h1 = tf.nn.relu(tf.matmul(obs, self.W[0]) + self.b[0])
q_preds = tf.matmul(h1, self.W[1]) + self.b[1] #linear activation
return tf.reshape(q_preds, shape=[4])
def choose_action(self, q_preds): #chooses action epsilon-greedily
with tf.name_scope('action_choice') as scope:
exploration_roll = tf.random_uniform([])
#greedy_action = tf.argmax(q_preds, 0) # gets the action with the highest predicted Q-value
#random_action = tf.cast(tf.floor(tf.random_uniform([], maxval=4.0)), tf.int64)
#exploration rate updates
#if self.step_count % 10000 == 0:
#self.exploration_rate.assign(tf.div(self.exploration_rate, 2))
return tf.select(tf.greater_equal(exploration_roll, self.exploration_rate),
tf.argmax(q_preds, 0), #greedy_action
tf.cast(tf.floor(tf.random_uniform([], maxval=4.0)), tf.int64)) #random_action
'''
Why does this return NoneType?:
flag = tf.select(tf.greater_equal(exploration_roll, self.exploration_rate), 'g', 'r')
if flag == 'g': #greedy
return tf.argmax(q_preds, 0) # gets the action with the highest predicted Q-value
elif flag == 'r': #random
return tf.cast(tf.floor(tf.random_uniform([], maxval=4.0)), tf.int64)
'''
def error(self, last_pred, r, next_pred):
with tf.name_scope('loss_function') as scope:
y = tf.add(r, tf.mul(self.discount_factor, next_pred)) #target
return tf.square(tf.sub(y, last_pred)) #squared difference error
def learn(self, loss): #Update parameters using stochastic gradient descent
#TODO: Either figure out how to avoid computing the q-prediction twice or just hardcode the gradients.
with tf.name_scope('train') as scope:
return tf.train.GradientDescentOptimizer(self.learning_rate).minimize(loss, var_list=[self.W[0], self.W[1], self.b[0], self.b[1]])
def max_q(self, q_preds):
with tf.name_scope('greedy_estimate') as scope:
return tf.reduce_max(q_preds) #best predicted action from current state
def act_to_pred(self, a, preds): #get the value prediction for action a
with tf.name_scope('get_prediction') as scope:
return tf.slice(preds, tf.reshape(a, shape=[1]), [1])
def agent_end(self,reward):
pass
def agent_cleanup(self):
self.sess.close()
pass
def agent_message(self,inMessage):
if inMessage=="what is your name?":
return "my name is simple_agent";
else:
return "I don't know how to respond to your message";
if __name__=="__main__":
AgentLoader.loadAgent(simple_agent())