可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have made a simple Tron game in C++ and a MLP with one hidden layer. I have implemented Q-learning in this neural network, however, it is not causing the agent to win more games over time (even after 1 million games). I will try to explain in text what I did, hopefully someone can spot a mistake, which might cause this problem.

At every state there are four possible moves (north, east, south, west) and the rewards are at the end of the game (-1 for loss, 0 for draw, 1 for win).

I initialise 4 MLPs, one for every possible action, with 100 input nodes (the entire game grid 10x10) where every point is 1 if the player itself is there, 0 if the point is empty, and -1 if the opponent has visited this point. Then there are 50 hidden nodes and 1 output node (I have also tried one network with 4 output nodes, but also this does not help). The weights are randomly chosen between -0.5 and 0.5

At every epoch I initialise the game environment with the 2 agents randomly placed in the grid. Then I run the game in while loop until the game is over and then reset the game environment. Within this while loop, I do the following.

I supply the MLP with the current state and determine the highest Q-Value and go there with a 90% chance (10% random movement). The Q-value is determined using a sigmoid or RELU activation function (I have tried both).
Then I calculate in the new state 4 Q-values and use this to train the network of my first move, with the following target: Target = reward + gamma*(maxQnextState). Then the error = Target - qValue calculated at previous state.
I use back propagation with the derivative of the sigmoid function and a high learning rate and momentum term to propagate the error backwards.

It seems as if my qValues are either very low (in the order 0.0001) or very close to 1 (0.999). And if I look at the error term at every 10.000 games, it does not seem to be decreasing.

I started with a MLP that could learn the XOR function, and use this for Q-learning now. Maybe some of the underlying assumptions in the XOR case are different and cause the problem for Q-learning?

Or maybe it is the sparse input (just 100 times a 0, 1 or -1) that makes it impossible to learn?

Suggestions are really appreciated!

回答1:

There are several factors that make difficult combine an MLP with Q-learning, specially for a newbie in the field. Rich Sutton (one of the Reinforcement Learning pionners) has a question in the FAQ of his web site related with your question. So I recommed you to read that document.

It's well known that Q-Learning + a feedforward neural network as a Q-function approximator can fail even in simple problems [Boyan & Moore, 1995].

A possible explanation is the phenomenok known as interference described in [Barreto & Anderson, 2008]:

Interference happens when the update of one state–action pair changes the Q-values of other pairs, possibly in the wrong direction.

Interference is naturally associated with generalization, and also happens in conventional supervised learning. Nevertheless, in the reinforcement learning paradigm its effects tend to be much more harmful. The reason for this is twofold. First, the combination of interference and bootstrapping can easily become unstable, since the updates are no longer strictly local. The convergence proofs for the algorithms derived from (4) and (5) are based on the fact that these operators are contraction mappings, that is, their successive application results in a sequence converging to a fixed point which is the solution for the Bellman equation [14,36]. When using approximators, however, this asymptotic convergence is lost, [...]

Another source of instability is a consequence of the fact that in on-line reinforcement learning the distribution of the incoming data depends on the current policy. Depending on the dynamics of the system, the agent can remain for some time in a region of the state space which is not representative of the entire domain. In this situation, the learning algorithm may allocate excessive resources of the function approximator to represent that region, possibly “forgetting” the previous stored information.

In conclusion, starting with an MLP to approximate the Q-function it's not a good idea.

References

Boyan, J. A. & Moore, A. W. (1995) Generalization in reinforcement learning: Safely approximating the value function. NIPS-7. San Mateo, CA: Morgan Kaufmann.

André da Motta Salles Barreto & Charles W. Anderson (2008) Restricted gradient-descent algorithm for value-function approximation in reinforcement learning, Artificial Intelligence 172 (2008) 454–482