I've seen such words as:
A policy defines the learning agent's way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.
But still didn't fully understand. What exactly is a policy in reinforcement learning?
The definition is correct, though not instantly obvious if you see it for the first time. Let me put it this way: a policy is an agent's strategy.
For example, imagine a world where a robot moves across the room and the task is to get to the target point (x, y), where it gets a reward. Here:
A policy is what an agent does to accomplish this task:
Obviously, some policies are better than others, and there are multiple ways to assess them, namely state-value function and action-value function. The goal of RL is to learn the best policy. Now the definition should make more sense (note that in the context time is better understood as a state):
A policy defines the learning agent's way of behaving at a given time.
Formally
More formally, we should first define Markov Decision Process (MDP) as a tuple (
S
,A
,P
,R
,y
), where:S
is a finite set of statesA
is a finite set of actionsP
is a state transition probability matrix (probability of ending up in a state for each current state and each action)R
is a reward function, given a state and an actiony
is a discount factor, between 0 and 1Then, a policy
π
is a probability distribution over actions given states. That is the likelihood of every action when an agent is in a particular state (of course, I'm skipping a lot of details here). This definition corresponds to the second part of your definition.I highly recommend David Silver's RL course available on YouTube. The first two lectures focus particularly on MDPs and policies.
Here is a succinct answer: a policy is the 'thinking' of the agent. It's the mapping of when you are in some state
s
, which actiona
should the agent take now? You can think of policies as a lookup table:If you are in state 1, you'd (assuming a greedy strategy) pick action 1. If you are in state 2, you'd pick action 2.
In plain words, in the simplest case, a policy
π
is a function that takes as input a states
and returns an actiona
. That is:π(s) → a
In this way, the policy is typically used by the agent to decide what action
a
should be performed when it is in a given states
.Sometimes, the policy can be stochastic instead of deterministic. In such a case, instead of returning a unique action
a
, the policy returns a probability distribution over a set of actions.In general, the goal of any RL algorithm is to learn an optimal policy that achieve a specific goal.