|View source on GitHub|
A PPO Agent implementing the KL penalty loss.
Please see details of the algorithm in (Schulman,2017): https://arxiv.org/abs/1707.06347
Disclaimer: We intend for this class to eventually fully replicate the KL penalty version of PPO from: https://github.com/openai/baselines/tree/master/baselines/ppo1 We are still working on resolving the differences in implementation details, such as mini batch learning and learning rate annealing.
PPO is a simplification of the TRPO algorithm, both of which add stability to policy gradient RL, while allowing multiple updates per batch of on-policy data.
TRPO enforces a hard optimization constraint, but is a complex algorithm, which often makes it harder to use in practice. PPO approximates the effect of TRPO by using a soft constraint. There are two methods presented in the paper for implementing the soft constraint: an adaptive KL loss penalty, and limiting the objective value based on a clipped version of the policy importance ratio. This agent implements the KL penalty version.
Note that PPOKLPenaltyAgent is known to have worse performance than PPOClipAgent (Schulman,2017). We included the implementation as it is an important baseline.
Note that PPOKLPenaltyAgent's behavior can be reproduced by the parent "PPOAgent" if the right set of parameters are set. However, we strongly encourage future clients to use PPOKLPenaltyAgent instead if you rely on the KL penalty version of PPO, because PPOKLPenaltyAgent abstracts away the parameters unrelated to this particular PPO version, making it less error prone.
Advantage is computed using Generalized Advantage Estimation (GAE): https://arxiv.org/abs/1506.02438
class PPOKLPenaltyAgent: A PPO Agent implementing the KL penalty loss.