Missed TensorFlow Dev Summit? Check out the video playlist. Watch recordings


View source on GitHub

Computes discounted return.

    rewards, discounts, final_value=None, time_major=True, provide_all_returns=True
Q_t = sum_{t'=t}^T gamma^(t'-t) * r_{t'} + gamma^(T-t+1)*final_value.

For details, see "Reinforcement Learning: An Introduction" Second Edition by Richard S. Sutton and Andrew G. Barto

Define abbreviations:

(B) batch size representing number of trajectories (T) number of steps per trajectory


  • rewards: Tensor with shape T, B representing rewards.
  • discounts: Tensor with shape T, B representing discounts.
  • final_value: Tensor with shape B representing value estimate at t=T. This is optional, when set, it allows final value to bootstrap the reward to go computation. Otherwise it's zero.
  • time_major: A boolean indicating whether input tensors are time major. False means input tensors have shape [B, T].
  • provide_all_returns: A boolean; if True, this will provide all of the returns by time dimension; if False, this will only give the single complete discounted return.


If provide_all_returns is True: A tensor with shape T, B representing the discounted returns. Shape is [B, T] when time_major is false. If provide_all_returns is False: A tensor with shape B representing the discounted returns.