tf_agents.trajectories.to_n_step_transition

Create an n-step transition from a trajectory with T=N + 1 frames.

The output transition's next_time_step.{reward, discount} will contain N-step discounted reward and discount values calculated as:

next_time_step.reward = r_t +
                        g^{1} * d_t * r_{t+1} +
                        g^{2} * d_t * d_{t+1} * r_{t+2} +
                        g^{3} * d_t * d_{t+1} * d_{t+2} * r_{t+3} +
                        ...
                        g^{N-1} * d_t * ... * d_{t+N-2} * r_{t+N-1}
next_time_step.discount = g^{N-1} * d_t * d_{t+1} * ... * d_{t+N-1}

In python notation:

discount = gamma**(N-1) * reduce_prod(trajectory.discount[:, :-1])
reward = discounted_return(
    rewards=trajectory.reward[:, :-1],
    discounts=gamma * trajectory.discount[:, :-1])

When trajectory.discount[:, :-1] is an all-ones tensor, this is equivalent to:

next_time_step.discount = (
    gamma**(N-1) * tf.ones_like(trajectory.discount[:, 0]))
next_time_step.reward = (
    sum_{n=0}^{N-1} gamma**n * trajectory.reward[:, n])

trajectory An instance of Trajectory. The tensors in Trajectory must have shape [B, T, ...]. discount is assumed to be a scalar float, hence the shape of trajectory.discount must be [B, T].
gamma A floating point scalar; the discount factor.

An N-step Transition where N = T - 1. The reward and discount in time_step.{reward, discount} are NaN. The n-step discounted reward and final discount are stored in next_time_step.{reward, discount}. All tensors in the Transition have shape [B, ...] (no time dimension).

ValueError if discount.shape.rank != 2.
ValueError if discount.shape[1] < 2.