Missed TensorFlow Dev Summit? Check out the video playlist. Watch recordings


View source on GitHub

Computes generalized advantage estimation (GAE).

    values, final_value, discounts, rewards, td_lambda=1.0, time_major=True

For theory, see "High-Dimensional Continuous Control Using Generalized Advantage Estimation" by John Schulman, Philipp Moritz et al. See https://arxiv.org/abs/1506.02438 for full paper.

Define abbreviations:

(B) batch size representing number of trajectories (T) number of steps per trajectory


  • values: Tensor with shape [T, B] representing value estimates.
  • final_value: Tensor with shape [B] representing value estimate at t=T.
  • discounts: Tensor with shape [T, B] representing discounts received by following the behavior policy.
  • rewards: Tensor with shape [T, B] representing rewards received by following the behavior policy.
  • td_lambda: A float32 scalar between [0, 1]. It's used for variance reduction in temporal difference.
  • time_major: A boolean indicating whether input tensors are time major. False means input tensors have shape [B, T].


A tensor with shape [T, B] representing advantages. Shape is [B, T] when time_major is false.