The question is about vanilla, non-batched reinforcement learning. Basically what is defined here in Sutton's book.
My model trains, (woohoo!) though there is an element that confuses me.
Background:
In an environment where duration is rewarded (like pole-balancing), we have rewards of (say) 1 per step. After an episode, before sending this array of 1's to the train step, we do the standard discounting and normalization to get returns:
returns = self.discount_rewards(rewards)
returns = (returns - np.mean(returns)) / (np.std(returns) + 1e-10) // usual normalization
The discount_rewards is the usual method, but here is gist if curious.
So an array of rewards [1,1,1,1,1,1,1,1,1] becomes an array of returns [1.539, 1.160, 0.777, 0.392, 0.006, -0.382, -0.773, -1.164, -1.556].
Given that basic background I can ask my question:
If positive returns are enforced, and negative returns are discouraged (in the optimize step), then no matter the length of the episode, roughly the first half of the actions will be encouraged, and the latter half will be discouraged. Is that true, or am I misunderstanding something?
If its NOT true, would love to understand what I got wrong.
If it IS true, then I don't understand why the model trains, since even a good-performing episode will have the latter half of its actions discouraged.
To reiterate, this is non-batched learning (so the returns are not relative to returns in another episode in the training step). After each episode, the model trains, and again, it trains well :)
Hoping this makes sense, and is short enough to feel like a proper clear question.
Background
- Yes, positive rewards are better than negative rewards
- No, positive rewards are not good on an absolute scale
- No, negative rewards are not bad on an absolute scale
If you increase or decrease all rewards (good and bad) equally, nothing changes really.
The optimizer tries to minimize the loss (maximize the reward), that means it's interested only in the delta between values (the gradient), not their absolute value or their sign.
Reinforcement Learning
Let's say your graph looks something like this:
...
logits = tf.nn.softmax(...)
labels = tf.one_hot(q_actions, n_actions)
loss = tf.losses.softmax_cross_entropy(labels, logits, weights=q_rewards)
The losses for the individual "classes" get scaled by weights
which in this case are q_rewards
:
loss[i] = -q_rewards[i] * tf.log( tf.nn.softmax( logits[i] ) )
The loss is a linear function of the reward, the gradient stays monotonic under linear transformation.
Reward Normalization
- doesn't mess with the sign of the gradient
- makes the gradient steeper for rewards far from the mean
- makes the gradient shallower for rewards near the mean
When the agent performs rather badly, it receives much more bad rewards than good rewards. Normalization makes the gradient steeper for (puts more weight on) the good rewards and shallower for (puts less weight on) the bad rewards.
When the agent performs rather good, it's the other way 'round.
Your questions
If positive returns are enforced, and negative returns are discouraged (in the optimize step) ...
It's not the sign (absolute value) but the delta (relative values).
... then no matter the length of the episode, roughly the first half of the actions will be encouraged, and the latter half will be discouraged.
If there are either much more high or much more low reward values, then you have a smaller half with a steeper gradient (more weight) and a larger half with a shallower gradient (less weight).
If it IS true, then I don't understand why the model trains, since even a good-performing episode will have the latter half of its actions discouraged.
Your loss value is actually expected to stay about constant at some point. So you have to measure your progress by running the program and looking at the (un-normalized) rewards.
For reference, see the example network from Google IO:
github.com/GoogleCloudPlatform/tensorflow-without-a-phd/.../tensorflow-rl-pong/... and search for _rollout_reward
This isn't a bad thing, however. It's just that your loss is (more or less) "normalized" as well. But the network keeps improving anyway by looking at the gradient at each training step.
Classification problems usually have a "global" loss which keeps falling over time. Some optimizers keep a history of the gradient to adapt the learning rate (effectively scaling the gradient) which means that internally, they also kinda "normalize" the gradient and thus don't care if we do either.
If you want to learn more about behind-the-scenes gradient scaling, I suggest taking a look at ruder.io/optimizing-gradient-descent
To reiterate, this is non-batched learning (so the returns are not relative to returns in another episode in the training step). After each episode, the model trains, and again, it trains well :)
The larger your batch size, the more stable your distribution of rewards, the more reliable the normalization. You could even normalize rewards across multiple episodes.