Module: tf_agents.bandits.agents.ranking_agent

Ranking agent.

This agent trains ranking policies. The policy has a scoring network used for scoring items. Some of these items will then be selected based on scores and similarity. The agent receives feedback based on which item in a recommendation list was interacted with. In this agent we assume either a score_vector or a cascading feedback framework. In the former case, the feedback is a vector of scores for every item in the slots. In the latter case, if the kth item was clicked, then the items up to k-1 receive a score of -1, the kth item receives a score based on a feedback value, while the rest of the items receive feedback of 0. The task of the agent is to train the scoring network to be able to estimate the above scores.

The observation the agent ingests contains the global features and the features of the items in the recommendation slots. The item features are stored in the per_arm part of the observation, in the order of how they are recommended. Since this ordered list of items expresses what action was taken by the policy, the action value of the trajectory is not used by the agent.

Note the difference between the per-arm part of the observation received by the policy and the agent: While the agent receives the items in the recommendation slots (as explained above), the policy receives the items that are available for recommendation. The user is responsible for converting the observation to the syntax required by the agent.


class FeedbackModel: Enumeration of feedback models.

class RankingAgent: Ranking agent class.

class RankingPolicyType: Enumeration of ranking policy types.


compute_score_tensor_for_cascading(...): Gives scores for all items in a batch.

CHOSEN_INDEX 'chosen_index'
CHOSEN_VALUE 'chosen_value'