Are Q-learning and SARSA with greedy selection equ

The difference between Q-learning and SARSA is that Q-learning compares the current state and the best possible next state, whereas SARSA compares the current state against the actual next state.

If a greedy selection policy is used, that is, the action with the highest action value is selected 100% of the time, are SARSA and Q-learning then identical?

标签： reinforcement-learning q-learning sarsa

3条回答

闹够了就滚

2楼-- · 2020-06-16 06:34

Well, not actually. A key difference between SARSA and Q-learning is that SARSA is an on-policy algorithm (it follows the policy that is learning) and Q-learning is an off-policy algorithm (it can follow any policy (that fulfills some convergence requirements).

Notice that in the following pseudocode of both algorithms, that SARSA choose a' and s' and then updates the Q-function; while Q-learning first updates the Q-function, and the next action to perform is selected in the next iteration, derived from the updated Q-function and not necessarily equal to the a' selected to update Q.

In any case, both algorithms require exploration (i.e., taking actions different from the greedy action) to converge.

The pseudocode of SARSA and Q-learning have been extracted from Sutton and Barto's book: Reinforcement Learning: An Introduction (HTML version)

0人赞添加讨论(0) 举报

贪生不怕死

3楼-- · 2020-06-16 06:38

If an optimal policy has already formed, SARSA with pure greedy and Q-learning are same.

However, in training, we only have a policy or sub-optimal policy, SARSA with pure greedy will only converge to the "best" sub-optimal policy available without trying to explore the optimal one, while Q-learning will do, because of , which means it tries all actions available and choose the max one.

0人赞添加讨论(0) 举报

对你真心纯属浪费

4楼-- · 2020-06-16 06:39

If we use only the greedy policy then there will be no exploration so the learning will not work. In the limiting case where epsilon goes to 0 (like 1/t for example), then SARSA and Q-Learning would converge to the optimal policy q*. However with epsilon being fixed, SARSA will converge to the optimal epsilon-greedy policy while Q-Learning will converge to the optimal policy q*.

I write a small note here to explain the differences between the two and hope it can help:

https://tcnguyen.github.io/reinforcement_learning/sarsa_vs_q_learning.html

0人赞添加讨论(0) 举报

Are Q-learning and SARSA with greedy selection equ

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间