I am currently reading Sutton's introduction about reinforcement learning. After arriving in chapter 10 (On-Policy prediction with approximation), I am now wondering how to choose the form of the function q
for which the optimal weights w
shall be approximated.
I am referring to the first line of the pseudo code below from Sutton: How do I choose a good differentiable function ? Are there any standard strategies to choose it?
You can choose any function approximator that is differentiable. Two commonly used classes of value function approximators are:
Linear function approximators: Linear combinations of features
where is a vector in with component given by and is the weight vector whose componenet is given by .
Neural Network
Represent using a neural network. You can either approximate using action-in (left of figure below) type or action-out (right of figure below) type. The difference being that the neural network can either take as input representations of both the state and the action and produce a single value (Q-value) as the output or take as input only the representation of state
s
and provide as output one value for each action, a in the action space (This type is easier to realize if the action space is discrete and finite).Using the first type (action-in) for the example as it is close to the example in the linear case, you could create a Q-value approximator using a neural network with the following approach:
You could also directly use the visuals (if available) as the input and use convolutional layers like in the DQN paper. But read the note below regarding the convergence and additional tricks to stabilize such non-linear approximator based method.
Graphically the function approximator looks like this:
Note that is an elementary function and is used to represent elements of the state-action vector. You can use any elementary function in place of . Some common ones are linear regressors, Radial Basis Functions etc.
A good differentiable function depends on the context. But in reinforcement learning settings, convergence properties and the error bounds are important. The Episodic semi-gradient Sarsa algorithm discussed in the book has similar convergence properties as of TD(0) for a constant policy.
Since you specifically asked for on-policy prediction, using a linear function approximator is advisable to use because it is guaranteed to converge. The following are some of the other properties that make the Linear function approximators suitable:
The error bound (as proved by Tsitsiklis & Roy,1997 for the general case of TD(lambda) ) is:
Which means that the asymptotic error will be no more than times the smallest possible error. Where is the discount factor. The gradient is simple to calculate!
Using a non-linear approximator (like a (deep) neural network) however does not inherently guarantee convergence. Gradient TD method uses the true gradient of the projected bellman error for the updates instead of the semi-gradient used in the Episodic semi-gradient Sarsa algorithm which is known to provide convergence even with non-linear function approximators (even for off-policy prediction) if certain conditions are met.