Achshah R M
- May 3
- 2 min read

UNDERSTANDING VALUE ITERATION IN REINFORCEMENT LEARNING

In my previous post, we examined policy iteration, a two-step algorithm where the first step involves policy evaluation—calculating the state-value function V(s) and iterating until the values converge. Next, in the policy improvement step, we compute the action-value function Q(s,a) and iterate until it converges. We then modify the policy based on the results of Q(s,a) and repeat from policy evaluation until no further improvements yield significant change. However, this process can be computationally expensive, which led to the introduction of value iteration—a more efficient version of policy iteration.

The Need for Convergence in Policy and Value Iteration

Understanding the need for convergence in V(s) and Q(s,a) can be challenging. The state-value function V(s) estimates the expected return if the agent follows a particular state, while the action-value function Q(s,a) estimates the expected return from taking an action in a given state. These values are not actual but are estimates based on probabilities and Bellman’s equation. These estimates can change with every iteration. Therefore, we calculate the difference between the V(s) of the current iteration and the previous iteration using δ, and repeat the iteration until this difference is smaller than our threshold ϵ. Similarly, we compute the difference for Q(s,a) until the value converges.

Value Iteration

Value iteration combines the policy evaluation and policy improvement steps into a single process. Instead of iterating separately for V(s) to converge and then for Q(s,a) to converge, we only need to iterate until Vk+1(s) converges. This updated equation captures both V(s) and Q(s,a) in one formula. Once this value converges, the policy is updated and the process iterates again.

The Equation:

Pseudocode:

As you can see, value iteration integrates policy evaluation (calculating expected returns) and policy improvement (selecting the best action based on those returns) into a single step, making it more efficient and faster.

UNDERSTANDING VALUE ITERATION IN REINFORCEMENT LEARNING

Recent Posts