Jump to content

Proximal policy optimization: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Phrasing, add links
Tags: Mobile edit Mobile web edit
General improvements (inc. fix typos, add clarification where needed, math formatting, remove unnecessary lines)
Tags: Mobile edit Mobile web edit
 
Line 1: Line 1:
{{Short description|Model-free reinforcement learning algorithm}}
{{Short description|Model-free reinforcement learning algorithm}}
{{Machine learning|Reinforcement learning}}
{{Machine learning|Reinforcement learning}}
{{More citations needed|date=October 2022|reason=Both sources currently in the article are from OpenAI. First paper is by researcher's at OpenAI, second is to OpenAI's website. What developments have been published since 2017?}}
{{More citations needed|date=October 2022|reason=Developments may have been published since 2017}}


'''Proximal policy optimization''' ('''PPO''') is a [[reinforcement learning]] algorithm for training an [[Artificial intelligence|intelligent agent]]'s decision function to accomplish difficult tasks. PPO was developed by John Schulman in 2017,<ref name=":0">J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms, arXiv.org, https://arxiv.org/abs/1707.06347 , arXiv:1707.06347 [cs.LG].</ref> and had become the default reinforcement learning algorithm at the US artificial intelligence company [[OpenAI]].<ref>OpenAI, "''Proximal Policy Optimization" Available at:'' https://openai.com/research/openai-baselines-ppo (Nov.1 2023 retrieved).</ref> Since 2018, PPO has seen success in a wide variety of applications, such as [[Robot control|controlling a robotic arm]], beating professional players at [[Dota 2]], and excelling in Atari games.<ref name=":1">Arxiv Insights. "An introduction to Policy Gradient methods," ''YouTube'', Oct 1st, 2018 [Video file]. Available: https://www.youtube.com/watch?v=5P7I-xPq8u8.</ref> Many experts called PPO the state of the art, due to its ability to strike a balance between performance and comprehension.{{Citation needed|date=January 2024}} Compared with other algorithms, the three main advantages of PPO are simplicity, stability, and sample efficiency.<ref name=":5">T. Simonini, “Proximal Policy Optimization (PPO), Hugging Face – The AI community building the future., https://huggingface.co/blog/deep-rl-ppo .</ref>
'''Proximal policy optimization''' ('''PPO''') is a [[reinforcement learning]] (RL) algorithm for training an [[Artificial intelligence|intelligent agent]]'s decision function to accomplish difficult tasks. PPO was developed by John Schulman in 2017,<ref name=":0">J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, "Proximal policy optimization algorithms," arXiv.org, https://arxiv.org/abs/1707.06347 , arXiv:1707.06347 [cs.LG]</ref> and had become the default RL algorithm at the US artificial intelligence company [[OpenAI]].<ref>OpenAI, "''Proximal Policy Optimization" Available at:'' https://openai.com/research/openai-baselines-ppo</ref> Since 2018, PPO has seen success in a wide variety of applications, such as [[Robot control|controlling a robotic arm]], beating professional players at [[Dota 2]], and excelling in Atari games.<ref name=":1">Arxiv Insights. "An introduction to Policy Gradient methods," ''YouTube'', Oct 1st, 2018 [Video file]. Available: https://www.youtube.com/watch?v=5P7I-xPq8u8</ref> Many experts called PPO the state of the art, due to its ability to strike a balance between performance and comprehension. Compared with other algorithms, the three main advantages of PPO are simplicity, stability, and sample efficiency.<ref name=":5">T. Simonini, "Proximal Policy Optimization (PPO)," Hugging Face – The AI community building the future., https://huggingface.co/blog/deep-rl-ppo</ref>


PPO is classified as a [[policy gradient method]] for training an agent’s policy [[Neural network (machine learning)|network]]. This network is the function used by the agent to make decisions. Essentially, to train the policy network, PPO takes a small policy update step, so the agent can reliably reach the optimal solution. A step size that is too big may direct the policy in a suboptimal direction, thus having little possibility of recovery; a step size that is too small lowers the overall efficiency. Consequently, PPO implements a ''clip function'' that constrains the policy update of an agent from being too large, so that larger step sizes may be used without negatively affecting the gradient ascent.<ref name=":5" />
PPO is classified as a [[Reinforcement learning#Direct policy search|policy gradient method]] for training an agent's policy [[Neural network (machine learning)|network]]. This network approximates a ''policy function'', used by the agent to make decisions. Essentially, to train the policy function, PPO takes a small policy update step, so the agent can reliably reach the optimal solution. A step size that is too big may direct the policy in a suboptimal direction, thus having little possibility of recovery; a step size that is too small lowers the overall efficiency. Consequently, PPO implements a ''clip function'' that constrains the policy update of an agent from being too large, so that larger step sizes may be used without negatively affecting the [[Gradient descent|gradient ascent]] process.<ref name=":5" />


== Development ==
== Development ==
[[Reinforcement learning]] (RL), to which PPO belongs, has roots in psychology and neuroscience. Compared with other fields of machine learning, RL closely mimics the kind of learning that humans and other animals do. Many of the core algorithms, including PPO, were originally inspired by biological learning systems, like psychologist [[Edward Thorndike]]'s learning by trial and error (1913).<ref>{{cite book|author1=R. Sutton|author2=A. Barto|title=Reinforcement learning: An introduction|url=https://beiyulincs.github.io/teach/spring_21/behavior_modeling/reading/rl_reading.pdf|date=2015|publisher=MIT Press}}</ref><ref>C. Mahoney, “Reinforcement Learning: A review of historic, modern, and historic developments ... | towards Data Science,” Medium, Mar. 30, 2022. [Online]. Available: https://towardsdatascience.com/reinforcement-learning-fda8ff535bb6#5554</ref>


In 2015, John Schulman introduced Trust Region Policy Optimization (TRPO) as an earlier version of PPO. TRPO addressed the instability issue found in the previous algorithm, [[Q-learning|deep q-network]] (DQN), by using the trust region constraint to regulate the [[Kullback–Leibler divergence|KL divergence]] between the old and new policy. However, TRPO is computationally complicated and inefficient due to its second-order optimization, leading to expensive and difficult implementation for large-scale problems.<ref name=":4">Wang, Y., He, H., Wen, C., & Tan, X. (2019). Truly Proximal Policy Optimization. ''ArXiv''. /abs/1903.07940</ref><ref name=":3">Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust Region Policy Optimization. ''ArXiv''. /abs/1502.05477</ref>
In 2015, John Schulman introduced Trust Region Policy Optimization (TRPO), which was a predecessor of PPO. TRPO addressed the instability issue of another algorithm, the [[Q-learning|Deep Q-Network]] (DQN), by using the trust region constraint to regulate the [[Kullback–Leibler divergence|KL divergence]] between the old and new policies. However, TRPO is [[Computational complexity theory|computationally complex]] and inefficient due to its use of second-order optimization, leading to difficulties and expensiveness when implementing solutions to large-scale problems.<ref name=":4">Wang, Y., He, H., Wen, C., & Tan, X. (2019). Truly Proximal Policy Optimization. ''ArXiv''. /abs/1903.07940</ref><ref name=":3">Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust Region Policy Optimization. ''ArXiv''. /abs/1502.05477</ref>


In 2017, John Schulman solved the complexity issue of TRPO by adopting first-order optimization in PPO. Schulman and his teams designed a clipping mechanism that forbids the new policy from deviating significantly from the old one when the likelihood ratio between them is out of clipping range.<ref name=":0" /><ref name=":3" /> In other words, PPO modifies TRPO’s objective function with a punishment of too-large policy updates. Also, PPO deletes the complicated trust region constraints and utilizes the clipping function instead. As a result, PPO improves performance and implementation based on the framework of TRPO.
In 2017, John Schulman solved the complexity issue of TRPO by adopting first-order optimization in PPO. Schulman and his team designed a clipping mechanism that stops the new policy from deviating significantly from the old one when the likelihood ratio between them exceeds a certain clip threshold.<ref name=":0" /><ref name=":3" /> In other words, PPO modifies the objective function of TRPO by penalizing policy updates that are too large. PPO also obviates the complicated trust region constraints, and utilizes the clipping function instead. As a result, PPO improves performance and implementation based on the framework of TRPO.


== Theory ==
== Theory ==
This section first explores key components of the core algorithm in PPO, and then delves deep into the main objective function in PPO.
This section first explores the key components of PPO, then delves deep into the main objective function.


=== Basic concepts ===
=== Basic concepts ===
To begin the PPO's training process, the agent is placed in an environment to perform actions based on its current input. In the early phase of training, the agent can freely explore solutions and keep track of the result. Later, with a certain amount of datasets and policy updates, the agent will select an action to take by randomly sampling from the [[probability distribution]] <math>P(A|S)</math> generated by the policy network.<ref>“A Beginner’s Guide to deep Reinforcement learning, ''Pathmind''. https://wiki.pathmind.com/deep-reinforcement-learning#reward </ref> The actions that are most likely to be beneficial will have the highest probability of being selected from the random sample. After an agent arrives at a different scenario known as a [[Reinforcement learning|State]] by acting, it is rewarded with a positive reward or a negative reward. The objective of an agent is to maximize its total rewards across a series of States, also referred as episodes. Scientists reinforce the agent to learn to perform the best actions by experience, and this decision function is called Policy.<ref name=":2">Q. T. Luu, “Q-learning vs. deep Q-learning vs. Deep Q-Network,” Baeldung on Computer Science, https://www.baeldung.com/cs/q-learning-vs-deep-q-learning-vs-deep-q-network (accessed Nov. 1, 2023).</ref>
To begin the PPO training process, the agent is set in an environment to perform actions based on its current input. In the early phase of training, the agent can freely explore solutions and keep track of the result. Later, with a certain amount of transition samples and policy updates, the agent will select an action to take by randomly sampling from the [[probability distribution]] <math>P(A|S)</math> generated by the policy network.<ref>"A Beginner's Guide to deep Reinforcement learning," ''Pathmind''. https://wiki.pathmind.com/deep-reinforcement-learning#reward </ref> The actions that are most likely to be beneficial will have the highest probability of being selected from the random sample. After an agent arrives at a different scenario (a new [[Reinforcement learning|state]]) by acting, it is rewarded with a positive reward or a negative reward. The objective of an agent is to maximize the cumulative reward signal across sequences of states, known as ''episodes''.


=== Policy gradient laws: advantage function A ===
=== Policy gradient laws: the advantage function ===
As PPO's essential part, the advantage function tries to answer the question of whether a specific action of the agent is better than the other possible action in a given state or worse than the other action. By definition, the advantage function is an estimate of the relative value for a selected action. The positive output of the advantage function means that the chosen action is better than the average return, so the possibilities of that specific action will increase, and vice versa.<ref name=":3" />
The ''advantage function'' (denoted as <math>A</math>) is central to PPO, as it tries to answer the question of whether a specific action of the agent is better or worse than some other possible action in a given state. By definition, the advantage function is an estimate of the relative value for a selected action. If the output of this function is positive, it means that the action in question is better than the average return, so the possibilities of selecting that specific action will increase. The opposite is true for a negative advantage output.<ref name=":3" />


Advantage function calculation: A = discounted sum (Q) - baseline estimate (V). The first part, the discounted sum, is the total weighted reward for the completion of a current episode. More weight will be given to a specific action that brings easy and quick rewards. On the other hand, less weight will be credited to actions that need significant effort but offer disproportionate rewards.<ref>OpenAI, “Part 1: Key concepts in RL, Part 1: Key Concepts in RL - Spinning Up documentation, https://spinningup.openai.com/en/latest/spinningup/rl_intro.html (accessed Nov. 4, 2023).</ref><ref name=":3" /> Since the advantage function is calculated after the completion of an episode, the program records the outcome of the episode. Therefore, calculating advantage is essentially an [[Unsupervised learning|unsupervised learning problem]]. The second part, the baseline estimate, is the value function that outputs the expected discounted sum of an episode starting from the current state. In the PPO algorithm, the baseline estimate will be noisy (with some [[Variance|variances]]) because it utilizes a [[Neural Network|neural network]]. With the two parts computed, the advantage function is calculated by subtracting the baseline estimate from the actual return (discounted sum).<ref>Rohitkumar, “PPO (Proximal Policy Optimization) explained with code examples in Pytorch and tensorflow, PlainSwipe, https://plainswipe.com/ppo-proximal-policy-optimization-explained-with-code-examples-in-pytorch-and-tensorflow/ (accessed Nov. 5, 2023).</ref> A > 0 signifies how much better the actual return of the action is based on the expected return from experience; A < 0 implies how bad the actual return is based on the expected return.
The advantage function can be defined as <math>A=Q-V</math>, where <math>Q</math> is the discounted sum of rewards (the total weighted reward for the completion of an episode) and <math>V</math> is the baseline estimate.<ref>OpenAI, "Part 1: Key concepts in RL," Part 1: Key Concepts in RL - Spinning Up documentation, https://spinningup.openai.com/en/latest/spinningup/rl_intro.html</ref><ref name=":3" /> Since the advantage function is calculated after the completion of an episode, the program records the outcome of the episode. Therefore, calculating advantage is essentially an [[unsupervised learning]] problem. The baseline estimate comes from the value function that outputs the ''expected'' discounted sum of an episode starting from the current state. In the PPO algorithm, the baseline estimate will be noisy (with some [[variance]]), as it also uses a [[Neural network (machine learning)|neural network]], like the policy function itself. With <math>Q</math> and <math>V</math> computed, the advantage function is calculated by subtracting the baseline estimate from the actual discounted return.<ref>Rohitkumar, "PPO (Proximal Policy Optimization) explained with code examples in PyTorch and TensorFlow," PlainSwipe, https://plainswipe.com/ppo-proximal-policy-optimization-explained-with-code-examples-in-pytorch-and-tensorflow/</ref> If <math>A>0</math>, the actual return of the action is better than the expected return from experience; if <math>A<0</math>, the actual return is worse.


=== Ratio function ===
=== Ratio function ===
In PPO, the ratio function calculates the probability of taking action ''a'' at state ''s'' in the current policy network divided by the previous old version of policy.
In PPO, the ''ratio function'' (<math>r_t</math>) calculates the probability of selecting action <math>a</math> in state <math>s</math> given the current policy network, divided by the previous probability under the old policy. In other words:


* If <math>r_t(\theta) > 1</math>, where <math>\theta</math> are the policy network parameters, then selecting action <math>a</math> in state <math>s</math> is more likely based on the current policy than the previous policy.
In this function, ''rt''(''θ'') denotes the probability ratio between the current and old policy:
* If <math>0 \le r_t(\theta) \le 1</math>, then selecting action <math>a</math> in state <math>s</math> is less likely based on the current policy than the old policy.


Hence, this ratio function can easily estimate the divergence between old and current policies.<ref>W. Heeswijk, "Proximal Policy Optimization (PPO) explained," Medium, https://towardsdatascience.com/proximal-policy-optimization-ppo-explained-abed1952457b</ref><ref name=":5" />
* If ''rt''(''θ'')>1, the action ''a'' at state ''s'' is more likely based on the current policy than the old policy.
* If ''rt''(''θ'') is between 0 and 1, the action a at state s is less likely based on the current policy than the old policy.


=== PPO objective function ===
This ratio function can easily estimate the divergence between old and current policies.<ref>W. Heeswijk, “Proximal Policy Optimization (PPO) explained, Medium, https://towardsdatascience.com/proximal-policy-optimization-ppo-explained-abed1952457b (accessed Nov. 4, 2023).</ref><ref name=":5" />
The [[Objective function (artificial intelligence)|objective function]] of PPO takes the [[expectation operator]] (denoted as <math>E</math>) which means that this function will be computed over quantities of trajectories. The expectation operator takes the minimum of two terms:


# <math>r_t(\theta) \cdot A</math>: this is the product of the ratio function and the advantage function introduced in TRPO, also known as the normal policy gradient objective.<ref>Edan Meyer. "Proximal Policy Optimization Explained," ''YouTube'', May 20th, 2021 [Video file]. Available: https://www.youtube.com/watch?v=HrapVFNBN64</ref>
=== PPO's objective function ===
# <math>\operatorname{clip}(r_t(\theta)) \cdot A</math>: the policy ratio is first clipped to the range <math>[1 - \epsilon, 1 + \epsilon]</math>; generally, <math>\epsilon</math> is defined to be 0.2. Then, as before, it is multiplied by the advantage.
The central [[Objective function (artificial intelligence)|objective function]] of PPO takes the [[expectation operator]] (denoted as E) which means that this function will be computed over quantities of trajectories. The expectation operator takes the minimum of two terms:


The fundamental intuition behind PPO is the same as that of TRPO: conservatism. Clipping results in a conservative advantage estimate of the new policy. The reasoning is that if an agent makes significant changes due to high advantage estimates, its policy update will be large and unstable, and may diverge from the optimal policy with little possibility of recovery.<ref>{{cite web|author1=C. M. Wild|title=The Pursuit of (Robotic) Happiness: How TRPO and PPO Stabilize Policy Gradient Methods|url=https://towardsdatascience.com/the-pursuit-of-robotic-happiness-how-trpoand-ppo-stabilize-policy-gradient-methods-545784094e3b|website=towardsdatascience.com|date=Jul 9, 2018}}</ref> There are two common applications of the clipping function: when an action under a new policy happens to be a good choice based on the advantage function, the clipping function limits how much credit can be given to a new policy for up-weighted good actions. On the other hand, when an action under the old policy is judged to be bad, the clipping function constrains how much the agent can accept the down-weighted bad actions of the new policy.<ref name=":6">Zheng, R., Dou, S., Gao, S., Hua, Y., Shen, W., Wang, B.,(2023). Secrets of RLHF in Large Language Models Part I: PPO. ''ArXiv''. /abs/2307.04964</ref> Consequently, the clipping mechanism is designed to discourage the incentive of moving beyond the defined range by clipping both directions. The advantage of this method is that it can be optimized directly with [[gradient descent]], as opposed to the strict KL divergence constraint of TRPO, making the implementation faster and more intuitive.
1. R-theta * Advantage Function: this is the product of the ratio function and the advantage function that was introduced in TRPO, also known as normal policy gradient objective.<ref>Edan Meyer. "Proximal Policy Optimization Explained," ''YouTube'', May 20th, 2021 [Video file]. Available: https://www.youtube.com/watch?v=HrapVFNBN64 (Accessed Nov 4, 2023).</ref>


After computing the clipped surrogate objective function, the agent has two probability ratios: one non-clipped and one clipped. Then, by taking the minimum of the two objectives, the final objective becomes a lower bound (a ''pessimistic'' bound) of what the agent knows is possible.<ref name=":6" /> In other words, the minimum method makes sure that the agent is doing the safest possible update.
2. Clipped (R-theta) * Advantage Function: The policy ratio is first clipped between 1- epsilon and 1 + epsilon; generally, epsilon is defined to be 0.20. Then, multiply the clipped version by the advantage.

The fundamental intuition behind PPO is the same as TRPO: conservatism. Clipping applies to make the "advantage estimate" of the new policy conservative. The reasoning behind conservatism is that if agents make significant changes due to high advantage estimates, the policy update will be large and unstable, and may "fall off the cliff" (little possibility of recovery).<ref>{{cite web|author1=C. M. Wild|title=The Pursuit of (Robotic) Happiness: How TRPO and PPO Stabilize Policy Gradient Methods|url=https://towardsdatascience.com/the-pursuit-of-robotic-happiness-how-trpoand-ppo-stabilize-policy-gradient-methods-545784094e3b|website=towardsdatascience.com|date=Jul 9, 2018}}</ref> There are two common applications of the clipping function. When an action under a new policy happens to be a really good action based on the advantage function, the clipping function limits how much credit can be given to a new policy for up-weighted good actions. On the other hand, when an action under the old policy is judged to be a bad action, the clipping function constrains how much the agent can cut the new policy slack for down-weighted bad actions.<ref name=":6">Zheng, R., Dou, S., Gao, S., Hua, Y., Shen, W., Wang, B.,(2023). Secrets of RLHF in Large Language Models Part I: PPO. ''ArXiv''. /abs/2307.04964</ref> Consequently, the clipping mechanism is designed to discourage the incentive of moving beyond the defined range by clipping both directions. The advantage of this method is that it can be optimized directly with [[gradient descent]], as opposed to TRPO's strict KL divergence constraint, which makes the implementation faster and cleaner.

After computing the clipped surrogate objective function, the program has two probability ratios: one non-clipped and one clipped; then, by taking the minimum of the two objectives, the final objective becomes a lower bound (pessimistic bound) of what an agent knows is possible.<ref name=":6" /> In other words, the minimum method makes sure that the agent is doing the safest possible update.


== Advantages ==
== Advantages ==


=== Simplicity ===
=== Simplicity ===
PPO approximates what TRPO did without doing too much computation. It uses first-order optimization (clip function) to constrain the policy update, while TRPO uses KL divergence constraints outside the [[Objective function (artificial intelligence)|objective function]] (second-order optimization). Compared with the TRPO, the PPO method is relatively easy to implement and takes less computation time. Therefore, it is cheaper and more efficient to use PPO in large-scale problems.<ref>J. Nocedal and Y. Nesterov., “Natural, trust region and proximal policy optimization, TransferLab, https://transferlab.ai/blog/trpo-and-ppo/ (accessed Nov. 5, 2023).</ref>
PPO approximates what TRPO does, with considerably less computation. It uses first-order optimization (the clip function) to constrain the policy update, while TRPO uses KL divergence constraints (second-order optimization). Compared to TRPO, the PPO method is relatively easy to implement and requires less computational resource and time. Therefore, it is cheaper and more efficient to use PPO in large-scale problems.<ref>J. Nocedal and Y. Nesterov., "Natural, trust region and proximal policy optimization," TransferLab, https://transferlab.ai/blog/trpo-and-ppo/</ref>


=== Stability ===
=== Stability ===
While other reinforcement learning algorithms require [[Hyperparameter (machine learning)|hyperparameter]] tuning, PPO does not necessarily require hyperparameter tuning (0.2 for epsilon can be used in most cases).<ref>J. Hui, “RL - reinforcement learning algorithms comparison, Medium, https://jonathan-hui.medium.com/rl-reinforcement-learning-algorithms-comparison-76df90f180cf (accessed Nov. 4, 2023).</ref> Also, PPO does not require sophisticated optimization techniques. It can be easily practiced with standard deep learning frameworks and generalized to a broad range of tasks.
While other RL algorithms require [[Hyperparameter (machine learning)|hyperparameter]] tuning, PPO comparatively does not require as much (0.2 for epsilon can be used in most cases).<ref>J. Hui, "RL - reinforcement learning algorithms comparison," Medium, https://jonathan-hui.medium.com/rl-reinforcement-learning-algorithms-comparison-76df90f180cf/</ref> Also, PPO does not require sophisticated optimization techniques. It can be easily practiced with standard [[deep learning]] frameworks and generalized to a broad range of tasks.


=== Sample efficiency ===
=== Sample efficiency ===
Sample efficiency indicates whether the algorithms need more or less amount of data to train a good policy. On-policy algorithms, including PPO and TRPO, generally have a low level of sample efficiency.<ref>Huang, Shengyi, and Dossa, “The 37 implementation details of proximal policy optimization, The 37 Implementation Details of Proximal Policy Optimization · The ICLR Blog Track, https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ (accessed Nov. 5, 2023).</ref> However, PPO achieved sample efficiency because of its usage of surrogate objectives. The surrogate objectives enable PPO to avoid the new policy changing too far from the old policy; the clip function regularizes the policy update and reuses training data. Sample efficiency is especially useful for complicated and high-dimensional tasks, where data collection and computation can be costly.<ref>XiaoYang-ElegantRL, “ElegantRL: Mastering PPO Algorithms - towards Data Science, ''Medium'', Nov. 23, 2022. [Online]. Available: https://towardsdatascience.com/elegantrl-mastering-the-ppo-algorithm-part-i-9f36bc47b791 </ref>
Sample efficiency indicates whether the algorithms need more or less data to train a good policy. On-policy algorithms, including PPO and TRPO, generally have good sample efficiency (they do not require as many samples as alternative algorithms).<ref>Huang, Shengyi, and Dossa, "The 37 implementation details of proximal policy optimization," The 37 Implementation Details of Proximal Policy Optimization · The ICLR Blog Track, https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/</ref> However, PPO achieved sample efficiency because of its use of surrogate objectives. The surrogate objective allows PPO to avoid the new policy moving too far from the old policy; the clip function [[Regularization (mathematics)|regularizes]] the policy update and reuses training data. Sample efficiency is especially useful for complicated and high-dimensional tasks, where data collection and computation can be costly.<ref>XiaoYang-ElegantRL, "ElegantRL: Mastering PPO Algorithms - towards Data Science," ''Medium'', Nov. 23, 2022. [Online]. Available: https://towardsdatascience.com/elegantrl-mastering-the-ppo-algorithm-part-i-9f36bc47b791 </ref>


== See also ==
== See also ==

Latest revision as of 15:46, 12 November 2024

Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent's decision function to accomplish difficult tasks. PPO was developed by John Schulman in 2017,[1] and had become the default RL algorithm at the US artificial intelligence company OpenAI.[2] Since 2018, PPO has seen success in a wide variety of applications, such as controlling a robotic arm, beating professional players at Dota 2, and excelling in Atari games.[3] Many experts called PPO the state of the art, due to its ability to strike a balance between performance and comprehension. Compared with other algorithms, the three main advantages of PPO are simplicity, stability, and sample efficiency.[4]

PPO is classified as a policy gradient method for training an agent's policy network. This network approximates a policy function, used by the agent to make decisions. Essentially, to train the policy function, PPO takes a small policy update step, so the agent can reliably reach the optimal solution. A step size that is too big may direct the policy in a suboptimal direction, thus having little possibility of recovery; a step size that is too small lowers the overall efficiency. Consequently, PPO implements a clip function that constrains the policy update of an agent from being too large, so that larger step sizes may be used without negatively affecting the gradient ascent process.[4]

Development

[edit]

In 2015, John Schulman introduced Trust Region Policy Optimization (TRPO), which was a predecessor of PPO. TRPO addressed the instability issue of another algorithm, the Deep Q-Network (DQN), by using the trust region constraint to regulate the KL divergence between the old and new policies. However, TRPO is computationally complex and inefficient due to its use of second-order optimization, leading to difficulties and expensiveness when implementing solutions to large-scale problems.[5][6]

In 2017, John Schulman solved the complexity issue of TRPO by adopting first-order optimization in PPO. Schulman and his team designed a clipping mechanism that stops the new policy from deviating significantly from the old one when the likelihood ratio between them exceeds a certain clip threshold.[1][6] In other words, PPO modifies the objective function of TRPO by penalizing policy updates that are too large. PPO also obviates the complicated trust region constraints, and utilizes the clipping function instead. As a result, PPO improves performance and implementation based on the framework of TRPO.

Theory

[edit]

This section first explores the key components of PPO, then delves deep into the main objective function.

Basic concepts

[edit]

To begin the PPO training process, the agent is set in an environment to perform actions based on its current input. In the early phase of training, the agent can freely explore solutions and keep track of the result. Later, with a certain amount of transition samples and policy updates, the agent will select an action to take by randomly sampling from the probability distribution generated by the policy network.[7] The actions that are most likely to be beneficial will have the highest probability of being selected from the random sample. After an agent arrives at a different scenario (a new state) by acting, it is rewarded with a positive reward or a negative reward. The objective of an agent is to maximize the cumulative reward signal across sequences of states, known as episodes.

Policy gradient laws: the advantage function

[edit]

The advantage function (denoted as ) is central to PPO, as it tries to answer the question of whether a specific action of the agent is better or worse than some other possible action in a given state. By definition, the advantage function is an estimate of the relative value for a selected action. If the output of this function is positive, it means that the action in question is better than the average return, so the possibilities of selecting that specific action will increase. The opposite is true for a negative advantage output.[6]

The advantage function can be defined as , where is the discounted sum of rewards (the total weighted reward for the completion of an episode) and is the baseline estimate.[8][6] Since the advantage function is calculated after the completion of an episode, the program records the outcome of the episode. Therefore, calculating advantage is essentially an unsupervised learning problem. The baseline estimate comes from the value function that outputs the expected discounted sum of an episode starting from the current state. In the PPO algorithm, the baseline estimate will be noisy (with some variance), as it also uses a neural network, like the policy function itself. With and computed, the advantage function is calculated by subtracting the baseline estimate from the actual discounted return.[9] If , the actual return of the action is better than the expected return from experience; if , the actual return is worse.

Ratio function

[edit]

In PPO, the ratio function () calculates the probability of selecting action in state given the current policy network, divided by the previous probability under the old policy. In other words:

  • If , where are the policy network parameters, then selecting action in state is more likely based on the current policy than the previous policy.
  • If , then selecting action in state is less likely based on the current policy than the old policy.

Hence, this ratio function can easily estimate the divergence between old and current policies.[10][4]

PPO objective function

[edit]

The objective function of PPO takes the expectation operator (denoted as ) which means that this function will be computed over quantities of trajectories. The expectation operator takes the minimum of two terms:

  1. : this is the product of the ratio function and the advantage function introduced in TRPO, also known as the normal policy gradient objective.[11]
  2. : the policy ratio is first clipped to the range ; generally, is defined to be 0.2. Then, as before, it is multiplied by the advantage.

The fundamental intuition behind PPO is the same as that of TRPO: conservatism. Clipping results in a conservative advantage estimate of the new policy. The reasoning is that if an agent makes significant changes due to high advantage estimates, its policy update will be large and unstable, and may diverge from the optimal policy with little possibility of recovery.[12] There are two common applications of the clipping function: when an action under a new policy happens to be a good choice based on the advantage function, the clipping function limits how much credit can be given to a new policy for up-weighted good actions. On the other hand, when an action under the old policy is judged to be bad, the clipping function constrains how much the agent can accept the down-weighted bad actions of the new policy.[13] Consequently, the clipping mechanism is designed to discourage the incentive of moving beyond the defined range by clipping both directions. The advantage of this method is that it can be optimized directly with gradient descent, as opposed to the strict KL divergence constraint of TRPO, making the implementation faster and more intuitive.

After computing the clipped surrogate objective function, the agent has two probability ratios: one non-clipped and one clipped. Then, by taking the minimum of the two objectives, the final objective becomes a lower bound (a pessimistic bound) of what the agent knows is possible.[13] In other words, the minimum method makes sure that the agent is doing the safest possible update.

Advantages

[edit]

Simplicity

[edit]

PPO approximates what TRPO does, with considerably less computation. It uses first-order optimization (the clip function) to constrain the policy update, while TRPO uses KL divergence constraints (second-order optimization). Compared to TRPO, the PPO method is relatively easy to implement and requires less computational resource and time. Therefore, it is cheaper and more efficient to use PPO in large-scale problems.[14]

Stability

[edit]

While other RL algorithms require hyperparameter tuning, PPO comparatively does not require as much (0.2 for epsilon can be used in most cases).[15] Also, PPO does not require sophisticated optimization techniques. It can be easily practiced with standard deep learning frameworks and generalized to a broad range of tasks.

Sample efficiency

[edit]

Sample efficiency indicates whether the algorithms need more or less data to train a good policy. On-policy algorithms, including PPO and TRPO, generally have good sample efficiency (they do not require as many samples as alternative algorithms).[16] However, PPO achieved sample efficiency because of its use of surrogate objectives. The surrogate objective allows PPO to avoid the new policy moving too far from the old policy; the clip function regularizes the policy update and reuses training data. Sample efficiency is especially useful for complicated and high-dimensional tasks, where data collection and computation can be costly.[17]

See also

[edit]

References

[edit]
  1. ^ a b J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, "Proximal policy optimization algorithms," arXiv.org, https://arxiv.org/abs/1707.06347 , arXiv:1707.06347 [cs.LG]
  2. ^ OpenAI, "Proximal Policy Optimization" Available at: https://openai.com/research/openai-baselines-ppo
  3. ^ Arxiv Insights. "An introduction to Policy Gradient methods," YouTube, Oct 1st, 2018 [Video file]. Available: https://www.youtube.com/watch?v=5P7I-xPq8u8
  4. ^ a b c T. Simonini, "Proximal Policy Optimization (PPO)," Hugging Face – The AI community building the future., https://huggingface.co/blog/deep-rl-ppo
  5. ^ Wang, Y., He, H., Wen, C., & Tan, X. (2019). Truly Proximal Policy Optimization. ArXiv. /abs/1903.07940
  6. ^ a b c d Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust Region Policy Optimization. ArXiv. /abs/1502.05477
  7. ^ "A Beginner's Guide to deep Reinforcement learning," Pathmind. https://wiki.pathmind.com/deep-reinforcement-learning#reward
  8. ^ OpenAI, "Part 1: Key concepts in RL," Part 1: Key Concepts in RL - Spinning Up documentation, https://spinningup.openai.com/en/latest/spinningup/rl_intro.html
  9. ^ Rohitkumar, "PPO (Proximal Policy Optimization) explained with code examples in PyTorch and TensorFlow," PlainSwipe, https://plainswipe.com/ppo-proximal-policy-optimization-explained-with-code-examples-in-pytorch-and-tensorflow/
  10. ^ W. Heeswijk, "Proximal Policy Optimization (PPO) explained," Medium, https://towardsdatascience.com/proximal-policy-optimization-ppo-explained-abed1952457b
  11. ^ Edan Meyer. "Proximal Policy Optimization Explained," YouTube, May 20th, 2021 [Video file]. Available: https://www.youtube.com/watch?v=HrapVFNBN64
  12. ^ C. M. Wild (Jul 9, 2018). "The Pursuit of (Robotic) Happiness: How TRPO and PPO Stabilize Policy Gradient Methods". towardsdatascience.com.
  13. ^ a b Zheng, R., Dou, S., Gao, S., Hua, Y., Shen, W., Wang, B.,(2023). Secrets of RLHF in Large Language Models Part I: PPO. ArXiv. /abs/2307.04964
  14. ^ J. Nocedal and Y. Nesterov., "Natural, trust region and proximal policy optimization," TransferLab, https://transferlab.ai/blog/trpo-and-ppo/
  15. ^ J. Hui, "RL - reinforcement learning algorithms comparison," Medium, https://jonathan-hui.medium.com/rl-reinforcement-learning-algorithms-comparison-76df90f180cf/
  16. ^ Huang, Shengyi, and Dossa, "The 37 implementation details of proximal policy optimization," The 37 Implementation Details of Proximal Policy Optimization · The ICLR Blog Track, https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
  17. ^ XiaoYang-ElegantRL, "ElegantRL: Mastering PPO Algorithms - towards Data Science," Medium, Nov. 23, 2022. [Online]. Available: https://towardsdatascience.com/elegantrl-mastering-the-ppo-algorithm-part-i-9f36bc47b791
[edit]