In previous blogs, we explored various classifications in reinforcement learning (RL) such as model-free vs. model-based and DP vs. MC vs. TD methods. Today, we’ll delve into another key categorization: On-Policy vs. Off-Policy learning, highlighting example algorithms that epitomize these categories.
On-Policy Learning: SARSA
SARSA, an acronym for State-Action-Reward-State-Action, is an on-policy, model-free algorithm within the Temporal Difference (TD) methods category. This algorithm learns directly from the actions taken by the policy currently in use, refining the same policy as it gathers more data.
How SARSA Works:
Q-Function: SARSA estimates Q-values, which represent the expected utility of taking a given action in a given state and following the current policy thereafter. These Q-values are stored in a Q-table, where rows represent states and columns represent actions. Below is a visual representation of a Q-table for four states (C1 to C4) and two actions (left and right) in each state:
Update Rule: The SARSA update rule is applied at each time step and is formulated as:
Q(st,at)←Q(st,at)+α[rt+1+γQ(st+1,at+1)−Q(st,at)]
Here, α is the learning rate, γ is the discount factor, rt+1 is the reward received after taking action at in state st, and Q(st+1,at+1) represents the value of the next state-action pair as dictated by the current policy.
Pseudocode for SARSA
Off-Policy Learning: Q-Learning
Q-Learning is another model-free TD method that diverges from SARSA by learning the optimal policy irrespective of the agent's actions. It uses a behavior policy for exploration but learns a separate, potentially different, optimal policy.
How Q-Learning Works:
Update Rule: The distinctive feature of Q-learning is its update rule, which is independent of the policy being followed:
Q(st,at)←Q(st,at)+α[rt+1+γmaxa Q(st+1,a)−Q(st,at)]. Unlike SARSA, Q-learning uses the maximum Q-value for the next state across all possible actions, reflecting the optimal return achievable from the next state.
Visualization of Q-table: Similar to the SARSA, Q-learning maintains a Q-table. However, the updates to this table are driven by the maximization step, ensuring that the policy converges to the optimal strategy over time.
Pseudocode for Q-Learning
Comparing SARSA and Q-Learning
While SARSA directly incorporates the effect of the policy's action-selection strategy into its updates (since it uses the actual next action selected by the policy), Q-Learning abstracts from the current policy and always looks for the best possible action from each state. This fundamental difference can lead to different behaviors under similar conditions, with Q-Learning typically being more aggressive in pursuing the optimal policy.
Conclusion
Understanding the nuances between on-policy and off-policy learning is crucial for effectively deploying RL algorithms. SARSA and Q-Learning provide foundational frameworks for exploring these concepts, each with its strengths and suitable applications. Whether to choose SARSA or Q-Learning often depends on the specific requirements of the environment and the desired balance between exploration and exploitation.
Comments