For details, see
"Reinforcement Learning: An Introduction" Second Edition
by Richard S. Sutton and Andrew G. Barto
B: batch size representing number of trajectories.
T: number of steps per trajectory. This is equal to N - n in the equation
Tensor with shape [T, B] (or [T]) representing rewards.
Tensor with shape [T, B] (or [T]) representing discounts.
(Optional.). Default: An all zeros tensor. Tensor with shape
[B] (or ) representing value estimate at T. This is optional;
when set, it allows final value to bootstrap the reward computation.
A boolean indicating whether input tensors are time major. False
means input tensors have shape [B, T].
A boolean; if True, this will provide all of the
returns by time dimension; if False, this will only give the single
complete discounted return.
A tensor with shape [T, B] (or [T]) representing the discounted
returns. The shape is [B, T] when not time_major.
If not provide_all_returns:
A tensor with shape [B] (or ) representing the discounted returns.