What is the difference between Q-learning and SARS

2019-01-29 23:10发布

Although I know that SARSA is on-policy while Q-learning is off-policy, when looking at their formulas it's hard (to me) to see any difference between these two algorithms.

According to the book Reinforcement Learning: An Introduction (by Sutton and Barto). In the SARSA algorithm, given a policy, the corresponding action-value function Q (in the state s and action a, at timestep t), i.e. Q(st, at), can be updated as follows

Q(st, at) = Q(st, at) + α*(rt + γ*Q(st+1, at+1) - Q(st, at))

On the other hand, the update step for the Q-learning algorithm is the following

Q(st, at) = Q(st, at) + α*(rt + γ*maxa Q(st+1, a) - Q(st, at))

which can also be written as

Q(st, at) = (1 - α) * Q(st, at) + α * (rt + γ*maxa Q(st+1, a))

where γ (gamma) is the discount factor and rt is the reward received from the environment at timestep t.

Is the difference between these two algorithms the fact that SARSA only looks up the next policy value while Q-learning looks up the next maximum policy value?

5条回答
做自己的国王
2楼-- · 2019-01-29 23:50

Yes, this is the only difference. On-policy SARSA learns action values relative to the policy it follows, while off-policy Q-Learning does it relative to the greedy policy. Under some common conditions, they both converge to the real value function, but at different rates. Q-Learning tends to converge a little slower, but has the capabilitiy to continue learning while changing policies. Also, Q-Learning is not guaranteed to converge when combined with linear approximation.

In practical terms, under the ε-greedy policy, Q-Learning computes the difference between Q(s,a) and the maximum action value, while SARSA computes the difference between Q(s,a) and the weighted sum of the average action value and the maximum:

Q-Learning: Q(st+1,at+1) = maxaQ(st+1,a)

SARSA: Q(st+1,at+1) = ε·meanaQ(st+1,a) + (1-ε)·maxaQ(st+1,a)

查看更多
我想做一个坏孩纸
3楼-- · 2019-01-29 23:59

In Q-Learning

This is your: Q-Learning: Q(St,At) = Q(St,At) + a [ R(t+1) + discount * max Q(St+1,At) - Q(St,At) ]

should be changed to Q-Learning: Q(St,At) = Q(St,At) + a [ R(t+1) + discount * max Q(St+1,a) - Q(St,At) ]

As you said, you have to find the maximum Q-value for the update eq. by changing the a, Then you will have a new Q(St,At). CAREFULLY, the a that give you the maximum Q-value is not the next action. At this stage, you only know the next state (St+1), and before going to next round, you want to update the St by the St+1 (St <-- St+1).

For each loop;

  • choose At from the St using the Q-value

  • take At and observe Rt+1 and St+1

  • Update Q-value using the eq.

  • St <-- St+1

Until St is terminal

查看更多
戒情不戒烟
4楼-- · 2019-01-30 00:01

There's an index mistake in your formula for Q-Learning. Page 148 of Sutton and Barto's.

Q(st,at) <-- Q(st,at) + alpha * [r(t+1) + gamma * max Q(st+1,a) - Q(st,at) ]

The typo is in the argument of the max:

the indexes are st+1 and a, while in your question they are st+1 and at+1 (these are correct for SARSA).

Hope this helps a bit.

查看更多
干净又极端
5楼-- · 2019-01-30 00:01

What is the difference mathematically?

As is already described in most other answers, the difference between the two updates mathematically is indeed that, when updating the Q-value for a state-action pair (St, At):

  • Sarsa uses the behaviour policy (meaning, the policy used by the agent to generate experience in the environment, which is typically epsilon-greedy) to select an additional action At+1, and then uses Q(St+1, At+1) (discounted by gamma) as expected future returns in the computation of the update target.
  • Q-learning does not use the behaviour policy to select an additional action At+1. Instead, it estimates the expected future returns in the update rule as maxA Q(St+1, A). The max operator used here can be viewed as "following" the completely greedy policy. The agent is not actually following the greedy policy though; it only says, in the update rule, "suppose that I would start following the greedy policy from now on, what would my expected future returns be then?".

What does this mean intuitively?

As mentioned in other answers, the difference described above means, using technical terminology, that Sarsa is an on-policy learning algorithm, and Q-learning is an off-policy learning algorithm.

In the limit (given an infinite amount of time to generate experience and learn), and under some additional assumptions, this means that Sarsa and Q-learning converge to different solutions / "optimal" policies:

  • Sarsa will converge to a solution that is optimal under the assumption that we keep following the same policy that was used to generate the experience. This will often be a policy with some element of (rather "stupid") randomness, like epsilon-greedy, because otherwise we are unable to guarantee that we'll converge to anything at all.
  • Q-Learning will converge to a solution that is optimal under the assumption that, after generating experience and training, we switch over to the greedy policy.

When to use which algorithm?

An algorithm like Sarsa is typically preferable in situations where we care about the agent's performance during the process of learning / generating experience. Consider, for example, that the agent is an expensive robot that will break if it falls down a cliff. We'd rather not have it fall down too often during the learning process, because it is expensive. Therefore, we care about its performance during the learning process. However, we also know that we need it to act randomly sometimes (e.g. epsilon-greedy). This means that it is highly dangerous for the robot to be walking alongside the cliff, because it may decide to act randomly (with probability epsilon) and fall down. So, we'd prefer it to quickly learn that it's dangerous to be close to the cliff; even if a greedy policy would be able to walk right alongside it without falling, we know that we're following an epsilon-greedy policy with randomness, and we care about optimizing our performance given that we know that we'll be stupid sometimes. This is a situation where Sarsa would be preferable.

An algorithm like Q-learning would be preferable in situations where we do not care about the agent's performance during the training process, but we just want it to learn an optimal greedy policy that we'll switch to eventually. Consider, for example, that we play a few practice games (where we don't mind losing due to randomness sometimes), and afterwards play an important tournament (where we'll stop learning and switch over from epsilon-greedy to the greedy policy). This is where Q-learning would be better.

查看更多
The star\"
6楼-- · 2019-01-30 00:07

When I was learning this part, I found it very confusing too, so I put together the two pseudo-codes from R.Sutton and A.G.Barto hoping to make the difference clearer.

enter image description here

Blue boxes highlight the part where the two algorithms actually differ. Numbers highlight the more detailed difference to be explained later.

TL;NR:

|             | SARSA | Q-learning |
|:-----------:|:-----:|:----------:|
| Choosing A' |   π   |      π     |
| Updating Q  |   π   |      μ     |

where π is a ε-greedy policy (e.g. ε > 0 with exploration), and μ is a greedy policy (e.g. ε == 0, NO exploration).

  1. Given that Q-learning is using different policies for choosing next action A' and updating Q. In other words, it is trying to evaluate π while following another policy μ, so it's an off-policy algorithm.

  2. In contrast, SARSA uses π all the time, hence it is an on-policy algorithm.

More detailed explanation:

  1. The most important difference between the two is how Q is updated after each action. SARSA uses the Q' following a ε-greedy policy exactly, as A' is drawn from it. In contrast, Q-learning uses the maximum Q' over all possible actions for the next step. This makes it look like following a greedy policy with ε=0, i.e. NO exploration in this part.

  2. However, when actually taking an action, Q-learning still uses the action taken from a ε-greedy policy. This is why "Choose A ..." is inside the repeat loop.

  3. Following the loop logic in Q-learning, A' is still from the ε-greedy policy.

查看更多
登录 后发表回答