A Network which returns the value prediction for input
states, with call(observation, step_type, network_state). Commonly, it
is set to value_network.ValueNetwork.
Number of epochs for computing policy updates. (Schulman,2017)
sets this to 10 for Mujoco, 15 for Roboschool and 3 for Atari.
Initial value for beta coefficient of adaptive
KL penalty. This initial value is not important in practice because the
algorithm quickly adjusts to it. A common default is 1.0.
Desired KL target for policy updates. If actual KL is
far from this target, adaptive_kl_beta will be updated. You should tune
this for your environment. 0.01 was found to perform well for Mujoco.
A tolerance for adaptive_kl_beta. Mean KL above
(1 + tol) * adaptive_kl_target, or below
(1 - tol) * adaptive_kl_target,
will cause adaptive_kl_beta to be updated. 0.5 was chosen
heuristically in the paper, but the algorithm is not very
sensitive to it.
If True, uses generalized advantage estimation for computing
per-timestep advantage. Else, just subtracts value predictions from
If True, uses td_lambda_return for training
value function; here:
td_lambda_return = gae_advantage + value_predictions.
use_gae must be set to True as well to enable TD -lambda returns. If
use_td_lambda_return is set to True while use_gae is False, the
empirical return will be used and a warning will be logged.
Lambda parameter for TD-lambda computation. Default to
0.95 which is the value used for all environments from the paper.
Discount factor for return computation. Default to 0.99
which is the value used for all environments from the paper.
Multiplier for value prediction loss to balance with
policy gradient loss. Default to 0.5, which was used for all
environments in the OpenAI baseline implementation. This parameters is
irrelevant unless you are sharing part of actor_net and value_net. In
that case, you would want to tune this coeeficient, whose value depends
on the network architecture of your choice
Coefficient for entropy regularization loss term.
Default to 0.0 because no entropy bonus was applied in the PPO paper.
Coefficient for L2 regularization of unshared actor_net
weights. Default to 0.0 because no L2 regularization was applied on
the policy network weights in the PPO paper.
Coefficient for l2 regularization of unshared value
function weights. Default to 0.0 because no L2 regularization was
applied on the policy network weights in the PPO paper.
Coefficient for l2 regularization of weights shared
between actor_net and value_net. Default to 0.0 because no L2
regularization was applied on either network in the PPO paper.
If True (default False), keeps moving mean and
variance of observations and normalizes incoming observations.
Additional optimization proposed in (Ilyas et al., 2018).
If True, keeps moving variance of rewards and
normalizes incoming rewards. While not mentioned directly in the PPO
paper, reward normalization was implemented in OpenAI baselines and
(Ilyas et al., 2018) pointed out that it largely improves performance.
You may refer to Figure 1 of https://arxiv.org/pdf/1811.02553.pdf for a
comparison with and without reward scaling.
Value above and below to clip normalized reward.
Additional optimization proposed in (Ilyas et al., 2018) set to
5 or 10.
+/- value for clipping log probs to prevent inf / NaN
values. Default: no clipping.
Norm length to clip gradients. Default: no clipping.
kl_cutoff_coef and kl_cutoff_factor are additional params
if one wants to use a KL cutoff loss term in addition to the adaptive KL
loss term. Default to 0.0 to disable the KL cutoff loss term as this was
not used in the paper. kl_cutoff_coef is the coefficient to mulitply by
the KL cutoff loss term, before adding to the total loss function.
Only meaningful when kl_cutoff_coef > 0.0. A multipler
used for calculating the KL cutoff ( =
kl_cutoff_factor * adaptive_kl_target). If policy KL averaged across
the batch changes more than the cutoff, a squared cutoff loss would
be added to the loss function.
A batch of experience data in the form of a Trajectory. The
structure of experience must match that of self.collect_data_spec.
All tensors in experience must be shaped [batch, time, ...] where
time must be equal to self.train_step_length if that
property is not None.
(optional). A Tensor, either 0-D or shaped [batch],
containing weights to be used when calculating the total train loss.
Weights are typically multiplied elementwise against the per-batch loss,
but the implementation is up to the Agent.
Any additional data as declared by self.train_argspec.
A LossInfo loss tuple containing loss and info tensors.
In eager mode, the loss values are first calculated, then a train step
is performed before they are returned.
In graph mode, executing any or all of the loss tensors
will first calculate the loss value(s), then perform a train step,
and return the pre-train-step LossInfo.
If experience is not type Trajectory. Or if experience
does not match self.collect_data_spec structure types.
If experience tensors' time axes are not compatible with
self.train_sequence_length. Or if experience does not match
If the user does not pass **kwargs matching
If the class was not initialized properly (super.__init__
was not called).