TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

criteo

Description:

Criteo Uplift Modeling Dataset

This dataset is released along with the paper: “A Large Scale Benchmark for Uplift Modeling” Eustache Diemert, Artem Betlei, Christophe Renaudin; (Criteo AI Lab), Massih-Reza Amini (LIG, Grenoble INP)

This work was published in: AdKDD 2018 Workshop, in conjunction with KDD 2018.

Data description

This dataset is constructed by assembling data resulting from several incrementality tests, a particular randomized trial procedure where a random part of the population is prevented from being targeted by advertising. it consists of 25M rows, each one representing a user with 11 features, a treatment indicator and 2 labels (visits and conversions).

Fields

Here is a detailed description of the fields (they are comma-separated in the file):

f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)
treatment: treatment group (1 = treated, 0 = control)
conversion: whether a conversion occured for this user (binary, label)
visit: whether a visit occured for this user (binary, label)
exposure: treatment effect, whether the user has been effectively exposed (binary)

Key figures

Format: CSV
Size: 459MB (compressed)
Rows: 25,309,483
Average Visit Rate: .04132
Average Conversion Rate: .00229
Treatment Ratio: .846

Tasks

The dataset was collected and prepared with uplift prediction in mind as the main task. Additionally we can foresee related usages such as but not limited to:

benchmark for causal inference
uplift modeling
interactions between features and treatment
heterogeneity of treatment
benchmark for observational causality methods
Additional Documentation: Explore on Papers With Code
Homepage: https://ailab.criteo.com/criteo-uplift-prediction-dataset/
Source code: tfds.recommendation.criteo.Criteo
Versions:
- 1.0.0: Initial release.
- 1.0.1 (default): Fixed parsing of fields conversion, visit and exposure.
Download size: 297.00 MiB
Dataset size: 3.55 GiB
Auto-cached (documentation): No
Splits:

Split	Examples
`'train'`	13,979,592

Feature structure:

FeaturesDict({
    'conversion': bool,
    'exposure': bool,
    'f0': float32,
    'f1': float32,
    'f10': float32,
    'f11': float32,
    'f2': float32,
    'f3': float32,
    'f4': float32,
    'f5': float32,
    'f6': float32,
    'f7': float32,
    'f8': float32,
    'f9': float32,
    'treatment': int64,
    'visit': bool,
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
conversion	Tensor	bool
exposure	Tensor	bool
f0	Tensor	float32
f1	Tensor	float32
f10	Tensor	float32
f11	Tensor	float32
f2	Tensor	float32
f3	Tensor	float32
f4	Tensor	float32
f5	Tensor	float32
f6	Tensor	float32
f7	Tensor	float32
f8	Tensor	float32
f9	Tensor	float32
treatment	Tensor	int64
visit	Tensor	bool

Supervised keys (See as_supervised doc): ({'exposure': 'exposure', 'f0': 'f0', 'f1': 'f1', 'f10': 'f10', 'f11': 'f11', 'f2': 'f2', 'f3': 'f3', 'f4': 'f4', 'f5': 'f5', 'f6': 'f6', 'f7': 'f7', 'f8': 'f8', 'f9': 'f9', 'treatment': 'treatment'}, 'visit')
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe):

Citation:

@inproceedings{Diemert2018,
author = { {Diemert Eustache, Betlei Artem} and Renaudin, Christophe and Massih-Reza, Amini},
title={A Large Scale Benchmark for Uplift Modeling},
publisher = {ACM},
booktitle = {Proceedings of the AdKDD and TargetAd Workshop, KDD, London,United Kingdom, August, 20, 2018},
year = {2018}
}