In a reinforcement learning system, there is an agent that takes actions in an environment and receives feedback in the form of rewards. A more specific description would be that the agent perceives the current condition of the environment, known as the state, and performs actions based on a policy. This policy is a strategy used to decide on the action. In return for the actions taken, the environment provides feedback in the form of rewards—positive or negative—depending on whether the action was desirable or unfavorable to achieve the goal state. Therefore, the reinforcement learning system is said to be goal-oriented, aiming to reach the goal state or end state with maximum cumulative rewards. The end state, also called the terminal state, is the final state where the agent can’t take any more actions. Not all terminal states are goal states, but all goal states are terminal states.

To understand the cumulative reward system, we need to learn about trajectory or rollout. A trajectory is the path taken by the agent to reach the terminal state. In other words, it is the sequence of states the agent went through, and the actions performed to reach the terminal state. For each action the agent performs, it receives a reward based on the reward types.

There are two types of rewards: sparse and dense rewards. A sparse reward is an infrequent reward received only when reaching the terminal state or more significant states, whereas a dense reward is continuous reward received for every action taken by the agent.

The sum of rewards received by the agent in a trajectory is called the return or cumulative rewards. This adds all the rewards received by the agent when following a path under a policy. The policy is the strategy used by agent to make decision. When calculating the return, we can add a hyperparameter called the discount factor. This hyperparameter is a multiplier added to the return function to reduce the value of future rewards exponentially. By reducing the value of future rewards, the trajectory that takes the longer route to the goal state will have fewer rewards, so the agent will learn to take the shortest route, which will have the highest rewards. In continuous tasks, the discount factor helps ensure that the value calculations converge, and that the agent does not get stuck in loops with no end. It discourages policies that might keep the agent in non-terminating cycles of states, which do not lead to significant progress. If the discount factor is 0, only the reward at the current time step t is considered in the return, and if the discount rate is 1, then all the rewards at all time steps are considered equally, making the model reach the end goal but possibly via a longer route. Choosing a discount factor less than 1 will help the agent choose the fastest path to reach the end goal in fewer time steps. Here time step is a variable that is incremented along with each action. If agent performs action at at time step t, it will move to next time step t+1 and then t+2, t+3, …, so on.

The policy is adjusted based on the return. The agent will iteratively follow a trajectory under a policy and calculate the return. Finally, it chooses the policy that gave the highest return. This policy that gives the highest return is called the optimal policy.

This learning process is adopted by many algorithms, such as Monte Carlo methods, but this method is computationally inefficient and takes a longer learning time as the agent follows trajectories under different policies iteratively, then finds the optimal policy. To solve this problem, value functions were introduced. Value functions estimate the expected return. The expected return is the prediction of the sum of all rewards. Before following a trajectory, the value function can estimate the cumulative reward the agent will receive if it follows this path. There are two types of value functions: the state value function and the action value function. The state value function, also called the V-function, estimates the expected return if the agent starts in a particular state. The action value function, also called the Q-function, estimates the expected return if the agent starts in a particular state and takes a particular action. Expected returns are used in planning and decision-making, allowing the agent to evaluate different policies without having to execute them. This way, we can reduce the computation required and the learning time. Value functions are central to many reinforcement learning algorithms, like Q-learning, SARSA, or policy gradient methods.

## Comentarios