top of page

UNDERSTANDING POLICY ITERATION IN REINFORCEMENT LEARNING

Updated: May 2

Policy is the strategy used by a reinforcement agent to decide which action to make in the current state. A policy takes state as input and returns the action to be performed. We know that the goal of reinforcement agent is to maximise the cumulative reward it receives and to achieve this goal optimal policy that maximises the cumulative reward has to be found. Policy iteration is a two-part algorithm that helps identify the optimal policy, which yields the highest cumulative reward. It comprises policy evaluation and policy improvement steps, using which we'll evaluate each policy and modify them until optimal policy is found.



Policy Evaluation

Policy evaluation is the initial step in policy iteration. Here, we calculate the state-value function, also known as the V-function, using Bellman's equation for all states in the state space. The state-value function estimates the expected return of being in a particular state, helping us evaluate how effective it is to follow the given policy at each state.

The algorithm starts with a random policy, π, and initializes the state-value function, V(s), to zero for all states. It then assigns the current value in V(s) to vold​ and updates V(s) as per Bellman's equation. The difference between the updated V(s) and voldvold​ is computed. If the difference is significantly small, it indicates the values have converged; otherwise, the steps are repeated until they converge. As a result of this step, we'll know the expected return for the policy π in each state.

1. Initialize:

   - Set V(s) = 0 for all states  s.

   - Set a small threshold ε (for determining convergence).

   - Initialize δ to a large number (to enter the loop).

2. Loop Until Convergence:

- Set δ to 0 at the start of each iteration (to capture the maximum change).

   - For each state s in the state space S:

            - Save the current value of V(s) to vold.

            - Update V(s) using the Bellman equation:



     - Calculate the absolute change |vold - V(s)|.

     - Update δ:



   - Check if δ  <  ε. If δ  <  ε, break from the loop; otherwise, continue.

3. Conclusion:

   - Once δ  <  ε, the value function  V(s) for the given policy π has converged to a stable set of values.

This algorithm ensures that all state values have sufficiently stabilized according to the policy π before moving on, making it a robust method for policy evaluation in reinforcement learning.


Policy Improvement

Policy improvement follows policy evaluation. In this step, we calculate the action-value function, also known as the Q-function, for each action in a state under the policy using the state value function from policy evaluation. The action-value function estimates the expected return of performing an action in a state under the policy. We repeat the calculation of the action function for all actions in all states to identify the best action-state pair that produces the maximum return and modify the current policy, πk​, to choose this action-state pair. The improved policy, πk+1​, will then be tested again using the policy evaluation step. We will repeat this process until the policy converges, i.e., further improvements do not yield significant changes in the policy.

1. Initialize:

  - Assume that you start with some policy π that is an arbitrary mapping from states to actions.

  - Initialize policy_is_stable to True to check for convergence.

2. Loop Over All States:

  - For each state s in the state space S:

- Save the current action prescribed by the policy for state s, aold= π(s) (where  π(s) is the prescribed action by policy when state s is passed).

            - Find the best action by using the action value function with argmax:



  - Check if the action has changed:

- If aold != π(s) set policy_is_stable to False.

3. Convergence Check:

  - If policy_is_stable is True, then the policy has converged, and you can end the improvement loop.

  - If policy_is_stable is False, repeat the loop with the improved policy, updating the value function V(s) as necessary before the next iteration of policy improvement.

 

In policy iteration, we repeat policy evaluation and policy improvement until we find the optimal value function and policy. Each iteration will produce a new and improved policy that is better than the previous iteration.

9 views0 comments
bottom of page