Q-Learning values get too high

2019-02-15 00:55发布

I've recently made an attempt to implement a basic Q-Learning algorithm in Golang. Note that I'm new to Reinforcement Learning and AI in general, so the error may very well be mine.

Here's how I implemented the solution to an m,n,k-game environment: At each given time t, the agent holds the last state-action (s, a) and the acquired reward for it; the agent selects a move a' based on an Epsilon-greedy policy and calculates the reward r, then proceeds to update the value of Q(s, a) for time t-1

func (agent *RLAgent) learn(reward float64) {
    var mState = marshallState(agent.prevState, agent.id)
    var oldVal = agent.values[mState]

    agent.values[mState] = oldVal + (agent.LearningRate *
        (agent.prevScore + (agent.DiscountFactor * reward) - oldVal))
}

Note:

  • agent.prevState holds previous state right after taking the action and before the environment responds (i.e. after the agent makes it's move and before the other player makes a move) I use that in place of the state-action tuple, but I'm not quite sure if that's the right approach
  • agent.prevScore holds the reward to previous state-action
  • The reward argument represents the reward for current step's state-action (Qmax)

With agent.LearningRate = 0.2 and agent.DiscountFactor = 0.8 the agent fails to reach 100K episodes because of state-action value overflow. I'm using golang's float64 (Standard IEEE 754-1985 double precision floating point variable) which overflows at around ±1.80×10^308 and yields ±Infiniti. That's too big a value I'd say!

Here's the state of a model trained with a learning rate of 0.02 and a discount factor of 0.08 which got through 2M episodes (1M games with itself):

Reinforcement learning model report
Iterations: 2000000
Learned states: 4973
Maximum value: 88781786878142287058992045692178302709335321375413536179603017129368394119653322992958428880260210391115335655910912645569618040471973513955473468092393367618971462560382976.000000
Minimum value: 0.000000

The reward function returns:

  • Agent won: 1
  • Agent lost: -1
  • Draw: 0
  • Game continues: 0.5

But you can see that the minimum value is zero, and the maximum value is too high.

It may be worth mentioning that with a simpler learning method I found in a python script works perfectly fine and feels actually more intelligent! When I play with it, most of the time the result is a draw (it even wins if I play carelessly), whereas with the standard Q-Learning method, I can't even let it win!

agent.values[mState] = oldVal + (agent.LearningRate * (reward - agent.prevScore))

Any ideas on how to fix this? Is that kind of state-action value normal in Q-Learning?!


Update: After reading Pablo's answer and the slight but important edit that Nick provided to this question, I realized the problem was prevScore containing the Q-value of previous step (equal to oldVal) instead of the reward of the previous step (in this example, -1, 0, 0.5 or 1).

After that change, the agent now behaves normally and after 2M episodes, the state of the model is as follows:

Reinforcement learning model report
Iterations: 2000000
Learned states: 5477
Maximum value: 1.090465
Minimum value: -0.554718

and out of 5 games with the agent, there were 2 wins for me (the agent did not recognize that I had two stones in a row) and 3 draws.

2条回答
我只想做你的唯一
2楼-- · 2019-02-15 01:22

If I've understood well, in your Q-learning update rule, you are using the current reward and the previous reward. However, the Q-learning rule only uses one reward (x are states and u are actions): enter image description here

On the other hand, you are assuming that the current reward is the same that Qmax value, which is not true. So probably you are misunderstanding the Q-learning algorithm.

查看更多
我欲成王,谁敢阻挡
3楼-- · 2019-02-15 01:43

The reward function is likely the problem. Reinforcement learning methods try to maximize the expected total reward; it gets a positive reward for every time step in the game, so the optimal policy is to play as long as possible! The q-values, which define the value function (expected total reward of taking an action in a state then behaving optimally) are growing because the correct expectation is unbounded. To incentivize winning, you should have a negative reward every time step (kind of like telling the agent to hurry up and win).

See 3.2 Goals and Rewards in Reinforcement Learning: An Introduction for more insight into the purpose and definition of reward signals. The problem you are facing is actually exercise 3.5 in the book.

查看更多
登录 后发表回答