An instance of Trajectory. The tensors in Trajectory must have
shape [B, T, ...]. discount is assumed to be a scalar float,
hence the shape of trajectory.discount must be [B, T].
gamma
A floating point scalar; the discount factor.
Returns
An N-step Transition where N = T - 1. The reward and discount in
time_step.{reward, discount} are NaN. The n-step discounted reward
and final discount are stored in next_time_step.{reward, discount}.
All tensors in the Transition have shape [B, ...] (no time dimension).