|View source on GitHub|
Ranking Python Bandit environment with items as per-arm features.
The observations are drawn with the help of the arguments
The user is modeled the following way: the score of an item is calculated as a weighted inner product of the global feature and the item feature. These scores for all elements of a recommendation are treated as unnormalized logits for a categorical distribution.
To model diversity and no-click, one can choose one from the following options:
--Do the following trick: every action (a list of recommended items) gets
item_dim many extra "ghost actions", represented with unit vectors as item
features. If, based on inner products and all the items in the
recommendation, one of these ghost items is chosen by the environment's user
model, it means there was no suitable candidate
in the neighborhood, and
thus it means that the user did not click on any of the real items. This
somewhat relates to diversity, as if the item feature space had been covered
better, the ghost items would have been selected with very low probability.
--Calculate the scores of all items, and if none of them exceeds a given
threshold, no item is selected by the user.
class ClickModel: Enumeration of user click models.
class FeedbackModel: Enumeration of feedback models.
class RankingPyEnvironment: Stationary Stochastic Bandit environment with per-arm features.