For theory, see "High-Dimensional Continuous Control Using Generalized Advantage Estimation" by John Schulman, Philipp Moritz et al. See https://arxiv.org/abs/1506.02438 for full paper.

(B) batch size representing number of trajectories (T) number of steps per trajectory

`values` Tensor with shape `[T, B]` representing value estimates.
`final_value` Tensor with shape `[B]` representing value estimate at t=T.
`discounts` Tensor with shape `[T, B]` representing discounts received by following the behavior policy.
`rewards` Tensor with shape `[T, B]` representing rewards received by following the behavior policy.
`td_lambda` A float32 scalar between [0, 1]. It's used for variance reduction in temporal difference.
`time_major` A boolean indicating whether input tensors are time major. False means input tensors have shape `[B, T]`.

A tensor with shape `[T, B]` representing advantages. Shape is `[B, T]` when `not time_major`.

[{ "type": "thumb-down", "id": "missingTheInformationINeed", "label":"Missing the information I need" },{ "type": "thumb-down", "id": "tooComplicatedTooManySteps", "label":"Too complicated / too many steps" },{ "type": "thumb-down", "id": "outOfDate", "label":"Out of date" },{ "type": "thumb-down", "id": "samplesCodeIssue", "label":"Samples / code issue" },{ "type": "thumb-down", "id": "otherDown", "label":"Other" }]
[{ "type": "thumb-up", "id": "easyToUnderstand", "label":"Easy to understand" },{ "type": "thumb-up", "id": "solvedMyProblem", "label":"Solved my problem" },{ "type": "thumb-up", "id": "otherUp", "label":"Other" }]