RSVP for your your local TensorFlow Everywhere event today!

TensorFlow Datasets

TFDS provides a collection of ready-to-use datasets for use with TensorFlow, Jax, and other Machine Learning frameworks.

It handles downloading and preparing the data deterministically and constructing a (or np.array).

View on Run in Google Colab View source on GitHub Download notebook


TFDS exists in two packages:

  • pip install tensorflow-datasets: The stable version, released every few months.
  • pip install tfds-nightly: Released every day, contains the last versions of the datasets.

This colab uses tfds-nightly:

pip install -q tfds-nightly tensorflow matplotlib
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds

Find available datasets

All dataset builders are subclass of tfds.core.DatasetBuilder. To get the list of available builders, use tfds.list_builders() or look at our catalog.


Load a dataset


The easiest way of loading a dataset is tfds.load. It will:

  1. Download the data and save it as tfrecord files.
  2. Load the tfrecord and create the
ds = tfds.load('mnist', split='train', shuffle_files=True)
assert isinstance(ds,
<_OptionsDataset shapes: {image: (28, 28, 1), label: ()}, types: {image: tf.uint8, label: tf.int64}>

Some common arguments:

  • split=: Which split to read (e.g. 'train', ['train', 'test'], 'train[80%:]',...). See our split API guide.
  • shuffle_files=: Control whether to shuffle the files between each epoch (TFDS store big datasets in multiple smaller files).
  • data_dir=: Location where the dataset is saved ( defaults to ~/tensorflow_datasets/)
  • with_info=True: Returns the tfds.core.DatasetInfo containing dataset metadata
  • download=False: Disable download


tfds.load is a thin wrapper around tfds.core.DatasetBuilder. You can get the same output using the tfds.core.DatasetBuilder API:

builder = tfds.builder('mnist')
# 1. Create the tfrecord files (no-op if already exists)
# 2. Load the ``
ds = builder.as_dataset(split='train', shuffle_files=True)
<_OptionsDataset shapes: {image: (28, 28, 1), label: ()}, types: {image: tf.uint8, label: tf.int64}>

tfds build CLI

If you want to generate a specific dataset, you can use the tfds command line. For example:

tfds build mnist

See the doc for available flags.

Iterate over a dataset

As dict

By default, the object contains a dict of tf.Tensors:

ds = tfds.load('mnist', split='train')
ds = ds.take(1)  # Only take a single example

for example in ds:  # example is `{'image': tf.Tensor, 'label': tf.Tensor}`
  image = example["image"]
  label = example["label"]
  print(image.shape, label)
['image', 'label']
(28, 28, 1) tf.Tensor(4, shape=(), dtype=int64)

To find out the dict key names and structure, look at the dataset documentation in our catalog. For example: mnist documentation.

As tuple (as_supervised=True)

By using as_supervised=True, you can get a tuple (features, label) instead for supervised datasets.

ds = tfds.load('mnist', split='train', as_supervised=True)
ds = ds.take(1)

for image, label in ds:  # example is (image, label)
  print(image.shape, label)
(28, 28, 1) tf.Tensor(4, shape=(), dtype=int64)

As numpy (tfds.as_numpy)

Uses tfds.as_numpy to convert:

ds = tfds.load('mnist', split='train', as_supervised=True)
ds = ds.take(1)

for image, label in tfds.as_numpy(ds):
  print(type(image), type(label), label)
<class 'numpy.ndarray'> <class 'numpy.int64'> 4

As batched tf.Tensor (batch_size=-1)

By using batch_size=-1, you can load the full dataset in a single batch.

This can be combined with as_supervised=True and tfds.as_numpy to get the the data as (np.array, np.array):

image, label = tfds.as_numpy(tfds.load(

print(type(image), image.shape)
<class 'numpy.ndarray'> (10000, 28, 28, 1)

Be careful that your dataset can fit in memory, and that all examples have the same shape.

Benchmark your datasets

Benchmarking a dataset is a simple tfds.benchmark call on any iterable (e.g., tfds.as_numpy,...).

ds = tfds.load('mnist', split='train')
ds = ds.batch(32).prefetch(1)

tfds.benchmark(ds, batch_size=32)
tfds.benchmark(ds, batch_size=32)  # Second epoch much faster due to auto-caching

************ Summary ************

Examples/sec (First included) 47889.92 ex/sec (total: 60000 ex, 1.25 sec)
Examples/sec (First only) 110.24 ex/sec (total: 32 ex, 0.29 sec)
Examples/sec (First excluded) 62298.08 ex/sec (total: 59968 ex, 0.96 sec)

************ Summary ************

Examples/sec (First included) 290380.50 ex/sec (total: 60000 ex, 0.21 sec)
Examples/sec (First only) 2506.57 ex/sec (total: 32 ex, 0.01 sec)
Examples/sec (First excluded) 309338.21 ex/sec (total: 59968 ex, 0.19 sec)

  • Do not forget to normalize the results per batch size with the batch_size= kwarg.
  • In the summary, the first warmup batch is separated from the other ones to capture extra setup time (e.g. buffers initialization,...).
  • Notice how the second iteration is much faster due to TFDS auto-caching.
  • tfds.benchmark returns a tfds.core.BenchmarkResult which can be inspected for further analysis.

Build end-to-end pipeline

To go further, you can look:


tfds.as_dataframe objects can be converted to pandas.DataFrame with tfds.as_dataframe to be visualized on Colab.

  • Add the tfds.core.DatasetInfo as second argument of tfds.as_dataframe to visualize images, audio, texts, videos,...
  • Use ds.take(x) to only display the first x examples. pandas.DataFrame will load the full dataset in-memory, and can be very expensive to display.
ds, info = tfds.load('mnist', split='train', with_info=True)

tfds.as_dataframe(ds.take(4), info)


tfds.show_examples returns a matplotlib.figure.Figure (only image datasets supported now):

ds, info = tfds.load('mnist', split='train', with_info=True)

fig = tfds.show_examples(ds, info)


Access the dataset metadata

All builders include a tfds.core.DatasetInfo object containing the dataset metadata.

It can be accessed through:

ds, info = tfds.load('mnist', with_info=True)
builder = tfds.builder('mnist')
info =

The dataset info contains additional informations about the dataset (version, citation, homepage, description,...).

    The MNIST database of handwritten digits.
    download_size=11.06 MiB,
    dataset_size=21.00 MiB,
        'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    supervised_keys=('image', 'label'),
        'test': <SplitInfo num_examples=10000, num_shards=1>,
        'train': <SplitInfo num_examples=60000, num_shards=1>,
      title={MNIST handwritten digit database},
      author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
      journal={ATT Labs [Online]. Available:},

Features metadata (label names, image shape,...)

Access the tfds.features.FeatureDict:

    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),

Number of classes, label names:

print(info.features["label"].int2str(7))  # Human readable version (8 -> 'cat')
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

Shapes, dtypes:

{'image': (28, 28, 1), 'label': ()}
{'image': tf.uint8, 'label': tf.int64}
(28, 28, 1)
<dtype: 'uint8'>

Split metadata (e.g. split names, number of examples,...)

Access the tfds.core.SplitDict:

{'test': <SplitInfo num_examples=10000, num_shards=1>, 'train': <SplitInfo num_examples=60000, num_shards=1>}

Available splits:

['test', 'train']

Get info on individual split:


It also works with the subsplit API:

[FileInstruction(filename='mnist-train.tfrecord-00000-of-00001', skip=9000, take=36000, num_examples=36000)]


Manual download (if download fails)

If download fails for some reason (e.g. offline,...). You can always manually download the data yourself and place it in the manual_dir (defaults to ~/tensorflow_datasets/download/manual/.

To find out which urls to download, look into:

Fixing NonMatchingChecksumError

TFDS ensure determinism by validating the checksums of downloaded urls. If NonMatchingChecksumError is raised, might indicate:

  • The website may be down (e.g. 503 status code). Please check the url.
  • For Google Drive URLs, try again later as Drive sometimes rejects downloads when too many people access the same URL. See bug
  • The original datasets files may have been updated. In this case the TFDS dataset builder should be updated. Please open a new Github issue or PR:
    • Register the new checksums with tfds build --register_checksums
    • Eventually update the dataset generation code.
    • Update the dataset VERSION
    • Update the dataset RELEASE_NOTES: What caused the checksums to change ? Did some examples changed ?
    • Make sure the dataset can still be built.
    • Send us a PR


If you're using tensorflow-datasets for a paper, please include the following citation, in addition to any citation specific to the used datasets (which can be found in the dataset catalog).

  title = { {TensorFlow Datasets}, A collection of ready-to-use datasets},
  howpublished = {\url{} },