Save the date! Google I/O returns May 18-20 Register now


Abstract base class for all datasets.

DatasetBuilder has 3 key methods:

  • documents the dataset, including feature names, types, and shapes, version, splits, citation, etc.
  • tfds.DatasetBuilder.download_and_prepare: downloads the source data and writes it to disk.
  • tfds.DatasetBuilder.as_dataset: builds an input pipeline using

Configuration: Some DatasetBuilders expose multiple variants of the dataset by defining a tfds.core.BuilderConfig subclass and accepting a config object (or name) on construction. Configurable datasets expose a pre-defined set of configurations in tfds.DatasetBuilder.builder_configs.

Typical DatasetBuilder usage:

mnist_builder = tfds.builder("mnist")
mnist_info =
datasets = mnist_builder.as_dataset()

train_dataset, test_dataset = datasets["train"], datasets["test"]
assert isinstance(train_dataset,

# And then the rest of your input pipeline
train_dataset = train_dataset.repeat().shuffle(1024).batch(128)
train_dataset = train_dataset.prefetch(2)
features =
image, label = features['image'], features['label']

data_dir directory to read/write data. Defaults to the value of the environment variable TFDS_DATA_DIR, if set, otherwise falls back to "~/tensorflow_datasets".
config tfds.core.BuilderConfig or str name, optional configuration for the dataset that affects the data generated on disk. Different builder_configs will have their own subdirectories and versions.
version Optional version at which to load the dataset. An error is raised if specified version cannot be satisfied. Eg: '1.2.3', '1.2.*'. The special value "experimental_latest" will use the highest version, even if not default. This is not recommended unless you know what you are doing, as the version could be broken.

builder_config tfds.core.BuilderConfig for this builder.



info tfds.core.DatasetInfo for this builder.



versions Versions (canonical + availables), in preference order.



View source

Constructs a

Callers must pass arguments as keyword arguments.

The output types vary depending on the parameters. Examples:

builder = tfds.builder('imdb_reviews')

# Default parameters: Returns the dict of
ds_all_dict = builder.as_dataset()
assert isinstance(ds_all_dict, dict)
print(ds_all_dict.keys())  # ==> ['test', 'train', 'unsupervised']

assert isinstance(ds_all_dict['test'],
# Each dataset (test, train, unsup.) consists of dictionaries
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
#  'text': <tf.Tensor: .. dtype=string, numpy=b"I've watched the movie ..">}
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
#  'text': <tf.Tensor: .. dtype=string, numpy=b'If you love Japanese ..'>}

# With as_supervised: only contains (feature, label) tuples
ds_all_supervised = builder.as_dataset(as_supervised=True)
assert isinstance(ds_all_supervised, dict)
print(ds_all_supervised.keys())  # ==> ['test', 'train', 'unsupervised']

assert isinstance(ds_all_supervised['test'],
# Each dataset (test, train, unsup.) consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)

# Same as above plus requesting a particular split
ds_test_supervised = builder.as_dataset(as_supervised=True, split='test')
assert isinstance(ds_test_supervised,
# The dataset consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)

split Which split of the data to load (e.g. 'train', 'test', ['train', 'test'], 'train[80%:]',...). See our split API guide. If None, will return all splits in a Dict[Split,].
batch_size int, batch size. Note that variable-length features will be 0-padded if batch_size is set. Users that want more custom behavior should use batch_size=None and use the API to construct a custom pipeline. If batch_size == -1, will return feature dictionaries of the whole dataset with tf.Tensors instead of a
shuffle_files bool, whether to shuffle the input files. Defaults to False.
decoders Nested dict of Decoder objects which allow to customize the decoding. The structure should match the feature structure, but only customized feature keys need to be present. See the guide for more info.
read_config tfds.ReadConfig, Additional options to configure the input pipeline (e.g. seed, num parallel reads,...).
as_supervised bool, if True, the returned will have a 2-tuple structure (input, label) according to If False, the default, the returned will have a dictionary with all the features.

Returns, or if split=None, dict<key: tfds.Split, value:>.

If batch_size is -1, will return feature dictionaries containing the entire dataset in tf.Tensors instead of a


View source

Downloads and prepares dataset for reading.

download_dir str, directory where downloaded files are stored. Defaults to "~/tensorflow-datasets/downloads".
download_config, further configuration for downloading and preparing dataset.

IOError if there is not enough disk space available.


View source

Returns the tfds.core.DatasetInfo object.

This function is called once and the result is cached for all following calls.

dataset_info The dataset metadata.







code_path Instance of tensorflow_datasets.core.utils.gpath.PosixGPath
name 'dataset_builder'
url_infos None