View source on GitHub

Computes discounted return.

Q_t = sum_{t'=t}^T gamma^(t'-t) * r_{t'} + gamma^(T-t+1)*final_value.

For details, see "Reinforcement Learning: An Introduction" Second Edition by Richard S. Sutton and Andrew G. Barto

Define abbreviations:

(B) batch size representing number of trajectories (T) number of steps per trajectory

rewards Tensor with shape T, B representing rewards.
discounts Tensor with shape T, B representing discounts.
final_value Tensor with shape B representing value estimate at t=T. This is optional, when set, it allows final value to bootstrap the reward to go computation. Otherwise it's zero.
time_major A boolean indicating whether input tensors are time major. False means input tensors have shape [B, T].
provide_all_returns A boolean; if True, this will provide all of the returns by time dimension; if False, this will only give the single complete discounted return.

If provide_all_returns is True: A tensor with shape T, B representing the discounted returns. Shape is [B, T] when time_major is false. If provide_all_returns is False: A tensor with shape B representing the discounted returns.