In reinforcement learning, what is the difference between policy iteration and value iteration?
As much as I understand, in value iteration, you use the Bellman equation to solve for the optimal policy, whereas, in policy iteration, you randomly select a policy π, and find the reward of that policy.
My doubt is that if you are selecting a random policy π in PI, how is it guaranteed to be the optimal policy, even if we are choosing several random policies.
In policy iteration algorithms, you start with a random policy, then find the value function of that policy (policy evaluation step), then find a new (improved) policy based on the previous value function, and so on. In this process, each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Given a policy, its value function can be obtained using the Bellman operator.
In value iteration, you start with a random value function and then find a new (improved) value function in an iterative process, until reaching the optimal value function. Notice that you can derive easily the optimal policy from the optimal value function. This process is based on the optimality Bellman operator.
In some sense, both algorithms share the same working principle, and they can be seen as two cases of the generalized policy iteration. However, the optimality Bellman operator contains a max operator, which is non linear and, therefore, it has different features. In addition, it's possible to use hybrid methods between pure value iteration and pure policy iteration.
As far as I am concerned, in contrary to @zyxue 's idea, VI is generally much faster than PI.
The reason is very straightforward, as you already knew, Bellman Equation is used for solving value function for given policy. Since we can solve the value function for optimal policy directly, solving value function for current policy is obviously a waste of time.
As for your question about the convergency of PI, I think you might overlook the fact that if you improve the strategy for each information state, then you improve the strategy for the whole game. This is also easy to prove, if you were familiar with Counterfactual Regret Minimization -- the sum of the regret for each information state has formed the upperbound of the overall regret, and thus minimizing the regret for each state will minimize the overall regret, which leads to the optimal policy.
Let's look at them side by side. The key parts for comparison are highlighted. Figures are from Sutton and Barto's book: Reinforcement Learning: An Introduction.
Key points:
In my experience, policy iteration is faster than value iteration, as a policy converges more quickly than a value function. I remember this is also described in the book.
I guess the confusion mainly came from all these somewhat similar terms, which also confused me before.