Generalizing Q-learning to work with a continuous

2019-01-31 13:49发布

问题:

I'm trying to get an agent to learn the mouse movements necessary to best perform some task in a reinforcement learning setting (i.e. the reward signal is the only feedback for learning).

I'm hoping to use the Q-learning technique, but while I've found a way to extend this method to continuous state spaces, I can't seem to figure out how to accommodate a problem with a continuous action space.

I could just force all mouse movement to be of a certain magnitude and in only a certain number of different directions, but any reasonable way of making the actions discrete would yield a huge action space. Since standard Q-learning requires the agent to evaluate all possible actions, such an approximation doesn't solve the problem in any practical sense.

回答1:

The common way of dealing with this problem is with actor-critic methods. These naturally extend to continuous action spaces. Basic Q-learning could diverge when working with approximations, however, if you still want to use it, you can try combining it with a self-organizing map, as done in "Applications of the self-organising map to reinforcement learning". The paper also contains some further references you might find useful.



回答2:

Fast forward to this year, folks from DeepMind proposes a deep reinforcement learning actor-critic method for dealing with both continuous state and action space. It is based on a technique called deterministic policy gradient. See the paper Continuous control with deep reinforcement learning and some implementations.



回答3:

There are numerous ways to extend reinforcement learning to continuous actions. One way is to use actor-critic methods. Another way is to use policy gradient methods.

A rather extensive explanation of different methods can be found in the following paper, which is available online: Reinforcement Learning in Continuous State and Action Spaces



回答4:

For what you're doing I don't believe you need to work in continuous action spaces. Although the physical mouse moves in a continuous space, internally the cursor only moves in discrete steps (usually at pixel levels), so getting any precision above this threshold seems like it won't have any effect on your agent's performance. The state space is still quite large, but it is finite and discrete.



回答5:

I know this post is somewhat old, but in 2016, a variant of Q-learning applied to continuous action spaces was proposed, as an alternative to actor-critic methods. It is called normalized advantage functions (NAF). Here's the paper: Continuous Deep Q-Learning with Model-based Acceleration