|View source on GitHub|
A PPO Agent implementing the clipped probability ratios.
Please see details of the algorithm in (Schulman,2017): https://arxiv.org/abs/1707.06347.
Disclaimer: We intend for this class to eventually fully replicate: https://github.com/openai/baselines/tree/master/baselines/ppo2
Currently, this agent surpasses the paper performance for average returns on Half-Cheetah when wider networks and higher learning rates are used. However, some details from this class still differ from the paper implementation. For example, we do not perform mini-batch learning and learning rate annealing yet. We are in working progress to reproduce the paper implementation exactly.
PPO is a simplification of the TRPO algorithm, both of which add stability to policy gradient RL, while allowing multiple updates per batch of on-policy data.
TRPO enforces a hard optimization constraint, but is a complex algorithm, which often makes it harder to use in practice. PPO approximates the effect of TRPO by using a soft constraint. There are two methods presented in the paper for implementing the soft constraint: an adaptive KL loss penalty, and limiting the objective value based on a clipped version of the policy importance ratio. This agent implements the clipped version.
The importance ratio clipping is described in eq (7) of https://arxiv.org/pdf/1707.06347.pdf
- To disable IR clipping, set the importance_ratio_clipping parameter to 0.0.
Note that the objective function chooses the lower value of the clipped and unclipped objectives. Thus, if the importance ratio exceeds the clipped bounds, then the optimizer will still not be incentivized to pass the bounds, as it is only optimizing the minimum.
Advantage is computed using Generalized Advantage Estimation (GAE): https://arxiv.org/abs/1506.02438
class PPOClipAgent: A PPO Agent implementing the clipped probability ratios.