Adding a dataset

Follow this guide to add a dataset to TFDS.

See our list of datasets to see if the dataset you want isn't already added.

Overview

Datasets are distributed in all kinds of formats and in all kinds of places, and they're not always stored in a format that's ready to feed into a machine learning pipeline. Enter TFDS.

TFDS provides a way to transform all those datasets into a standard format, do the preprocessing necessary to make them ready for a machine learning pipeline, and provides a standard input pipeline using tf.data.

To enable this, each dataset implements a subclass of DatasetBuilder, which specifies:

  • Where the data is coming from (i.e. its URL);
  • What the dataset looks like (i.e. its features);
  • How the data should be split (e.g. TRAIN and TEST);
  • and the individual records in the dataset.

The first time a dataset is used, the dataset is downloaded, prepared, and written to disk in a standard format. Subsequent access will read from those pre-processed files directly.

Note: Currently we do not support datasets that take longer than 1 day to generate on a single machine. See the section below on large datasets.

Writing my_dataset.py

DatasetBuilder

Each dataset is defined as a subclass of tfds.core.DatasetBuilder implementing the following methods:

  • _info: builds the DatasetInfo object describing the dataset
  • _download_and_prepare: to download and serialize the source data to disk
  • _as_dataset: to produce a tf.data.Dataset from the serialized data

Most datasets subclass tfds.core.GeneratorBasedBuilder, which is a subclass of tfds.core.DatasetBuilder that simplifies defining a dataset. It works well for datasets that can be generated on a single machine. Its subclasses implement:

  • _info: builds the DatasetInfo object describing the dataset
  • _split_generators: downloads the source data and defines the dataset splits
  • _generate_examples: yields examples in the dataset from the source data

This guide will use GeneratorBasedBuilder.

my_dataset.py

my_dataset.py first looks like this:

import tensorflow_datasets.public_api as tfds

class MyDataset(tfds.core.GeneratorBasedBuilder):
  """Short description of my dataset."""

  VERSION = tfds.core.Version('0.1.0')

  def _info(self):
    # Specifies the tfds.core.DatasetInfo object
    pass # TODO

  def _split_generators(self, dl_manager):
    # Downloads the data and defines the splits
    # dl_manager is a tfds.download.DownloadManager that can be used to
    # download and extract URLs
    pass  # TODO

  def _generate_examples(self):
    # Yields examples from the dataset
    pass  # TODO

If you'd like to follow a test-driven development workflow, which can help you iterate faster, jump to the testing instructions below, add the test, and then return here.

Specifying DatasetInfo

DatasetInfo describes the dataset.

class MyDataset(tfds.core.GeneratorBasedBuilder):

  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        # This is the description that will appear on the datasets page.
        description=("This is the dataset for xxx. It contains yyy. The "
                     "images are kept at their original dimensions."),
        # tfds.features.FeatureConnectors
        features=tfds.features.FeaturesDict({
            "image_description": tfds.features.Text(),
            "image": tfds.features.Image(),
            # Here, labels can be of 5 distinct values.
            "label": tfds.features.ClassLabel(num_classes=5),
        }),
        # If there's a common (input, target) tuple from the features,
        # specify them here. They'll be used if as_supervised=True in
        # builder.as_dataset.
        supervised_keys=("image", "label"),
        # Homepage of the dataset for documentation
        urls=["https://dataset-homepage.org"],
        # Bibtex citation for the dataset
        citation=r"""@article{my-awesome-dataset-2020,
                              author = {Smith, John},"}""",
    )

FeatureConnectors

Each feature is specified in DatasetInfo as a tfds.features.FeatureConnector. FeatureConnectors document each feature, provide shape and type checks, and abstract away serialization to and from disk. There are many feature types already defined and you can also add a new one.

If you've implemented the test harness, test_info should now pass.

Downloading and extracting source data

Most datasets need to download data from the web. All downloads and extractions must go through the tfds.download.DownloadManager. DownloadManager currently supports extracting .zip, .gz, and .tar files.

For example, one can both download and extract URLs with download_and_extract:

def _split_generators(self, dl_manager):
  # Equivalent to dl_manager.extract(dl_manager.download(urls))
  dl_paths = dl_manager.download_and_extract({
      'foo': 'https://example.com/foo.zip',
      'bar': 'https://example.com/bar.zip',
  })
  dl_paths['foo'], dl_paths['bar']

Manual download and extraction

For source data that cannot be automatically downloaded (for example, it may require a login), the user will manually download the source data and place it in manual_dir, which you can access with dl_manager.manual_dir (defaults to ~/tensorflow_datasets/manual/my_dataset).

Specifying dataset splits

If the dataset comes with pre-defined splits (for example, MNIST has train and test splits), keep those splits in the DatasetBuilder. If this is your own data and you can decide your own splits, we suggest using a split of (TRAIN:80%, VALIDATION: 10%, TEST: 10%). Users can always get subsplits through tfds.Split.subsplit.

  def _split_generators(self, dl_manager):
    # Download source data
    extracted_path = dl_manager.download_and_extract(...)

    # Specify the splits
    return [
        tfds.core.SplitGenerator(
            name="train",
            num_shards=10,
            gen_kwargs={
                "images_dir_path": os.path.join(extracted_path, "train"),
                "labels": os.path.join(extracted_path, "train_labels.csv"),
            },
        ),
        tfds.core.SplitGenerator(
            name="test",
            num_shards=1,
            gen_kwargs={
                "images_dir_path": os.path.join(extracted_path, "test"),
                "labels": os.path.join(extracted_path, "test_labels.csv"),
            },
        ),
    ]

SplitGenerator describes how a split should be generated. gen_kwargs will be passed as keyword arguments to _generate_examples, which we'll define next.

When specifying num_shards, which determines how many files the split will use, pick a number such that a single shard is less that 4 GiB as as each shard will be loaded in memory for shuffling.

Writing an example generator

_generate_examples generates the examples for each split from the source data. For the TRAIN split with the gen_kwargs defined above, _generate_examples will be called as:

builder._generate_examples(
    images_dir_path="{extracted_path}/train",
    labels="{extracted_path}/train_labels.csv",
)

This method will typically read source dataset artifacts (e.g. a CSV file) and yield feature dictionaries that correspond to the features specified in DatasetInfo.

def _generate_examples(self, images_dir_path, labels):
  # Read the input data out of the source files
  for image_file in tf.io.gfile.listdir(images_dir_path):
    ...
  with tf.io.gfile.GFile(labels) as f:
    ...

  # And yield examples as feature dictionaries
  for image_id, description, label in data:
    yield {
        "image_description": description,
        "image": "%s/%s.jpeg" % (images_dir_path, image_id),
        "label": label,
    }

DatasetInfo.features.encode_example will encode these dictionaries into a format suitable for writing to disk (currently we use tf.train.Example protocol buffers). For example, tfds.features.Image will copy out the JPEG content of the passed image files automatically.

If you've implemented the test harness, your builder test should now pass.

File access and tf.io.gfile

In order to support Cloud storage systems, use tf.io.gfile or other TensorFlow file APIs (for example, tf.python_io) for all filesystem access. Avoid using Python built-ins for file operations (e.g. open, os.rename, gzip, etc.).

Extra dependencies

Some datasets require additional Python dependencies during data generation. For example, the SVHN dataset uses scipy to load some data. In order to keep the tensorflow-datasets package small and allow users to install additional dependencies only as needed, use tfds.core.lazy_imports.

To use lazy_imports:

  • Add an entry for your dataset into DATASET_EXTRAS in setup.py. This makes it so that users can do, for example, pip install 'tensorflow-datasets[svhn]' to install the extra dependencies.
  • Add an entry for your import to LazyImporter and to the LazyImportsTest.
  • Use tfds.core.lazy_imports to access the dependency (for example, tfds.core.lazy_imports.scipy) in your DatasetBuilder.

Corrupted data

Some datasets are not perfectly clean and contain some corrupt data (for example, the images are in JPEG files but some are invalid JPEG). These examples should be skipped, but leave a note in the dataset description how many examples were dropped and why.

Inconsistent data

Some datasets provide a set of URLs for individual records or features (for example, URLs to various images around the web) that may or may not exist anymore. These datasets are difficult to version properly because the source data is unstable (URLs come and go).

If the dataset is inherently unstable (that is, if multiple runs over time may not yield the same data), mark the dataset as unstable by adding a class constant to the DatasetBuilder: UNSTABLE = "<why this dataset is unstable">. For example, UNSTABLE = "Downloads URLs from the web."

Dataset configuration

Some datasets may have variants that should be exposed, or options for how the data is preprocessed. These configurations can be separated into 2 categories:

  1. "Heavy": Configuration that affects how the data is written to disk. We'll call this "heavy" configuration.
  2. "Light": Configuration that affects runtime preprocessing (i.e. configuration that can be done in a tf.data input pipeline). We'll call this "light" configuration.

Heavy configuration with BuilderConfig

Heavy configuration affects how the data is written to disk. For example, for text datasets, different TextEncoders and vocabularies affect the token ids that are written to disk.

Heavy configuration is done through tfds.core.BuilderConfigs:

  1. Define your own configuration object as a subclass of tfds.core.BuilderConfig. For example, MyDatasetConfig.
  2. Define the BUILDER_CONFIGS class member in MyDataset that lists MyDatasetConfigs that the dataset exposes.
  3. Use self.builder_config in MyDataset to configure data generation. This may include setting different values in _info() or changing download data access.

Datasets with BuilderConfigs have a name and version per config, so the fully qualified name of a particular variant would be dataset_name/config_name (for example, "lm1b/bytes"). The config defaults to the first one in BUILDER_CONFIGS (for example "lm1b" defaults to "lm1b/plain_text").

See Lm1b for an example of a dataset that uses BuilderConfigs.

Light configuration with constructor args

For situations where alterations could be made on-the-fly in the tf.data input pipeline, add keyword arguments to the MyDataset constructor, store the values in member variables, and then use them later. For example, override _as_dataset(), call super() to get the base tf.data.Dataset, and then do additional transformations based on the member variables.

Create your own FeatureConnector

Note that most datasets will find the current set of tfds.features.FeatureConnectors sufficient, but sometimes a new one may need to be defined.

tfds.features.FeatureConnectors in DatasetInfo correspond to the elements returned in the tf.data.Dataset object. For instance, with:

tfds.DatasetInfo(features=tfds.features.FeatureDict({
    'input': tfds.features.Image(),
    'output': tfds.features.Text(encoder=tfds.text.ByteEncoder()),
    'metadata': {
        'description': tfds.features.Text(),
        'img_id': tf.int32,
    },
}))

The items in tf.data.Dataset object would look like:

{
    'input': tf.Tensor(shape=(None, None, 3), dtype=tf.uint8),
    'output': tf.Tensor(shape=(None,), dtype=tf.int32),  # Sequence of token ids
    'metadata': {
        'description': tf.Tensor(shape=(), dtype=tf.string),
        'img_id': tf.Tensor(shape=(), dtype=tf.int32),
    },
}

The tfds.features.FeatureConnector object abstracts away how the feature is encoded on disk from how it is presented to the user. Below is a diagram showing the abstraction layers of the dataset and the transformation from the raw dataset files to the tf.data.Dataset object.

DatasetBuilder abstraction layers

To create your own feature connector, subclass tfds.features.FeatureConnector and implement the abstract methods:

  • get_tensor_info(): Indicates the shape/dtype of the tensor(s) returned by tf.data.Dataset
  • encode_example(input_data): Defines how to encode the data given in the generator _generate_examples() into a tf.train.Example compatible data
  • decode_example: Defines how to decode the data from the tensor read from tf.train.Example into user tensor returned by tf.data.Dataset.
  • (optionally) get_serialized_info(): If the info returned by get_tensor_info() is different from how the data are actually written on disk, then you need to overwrite get_serialized_info() to match the specs of the tf.train.Example
  1. If your connector only contains one value, then the get_tensor_info, encode_example, and decode_example methods can directly return single value (without wrapping it in a dict).

  2. If your connector is a container of multiple sub-features, the easiest way is to inherit from tfds.features.FeaturesDict and use the super() methods to automatically encode/decode the sub-connectors.

Have a look at tfds.features.FeatureConnector for more details and the features package for more examples.

Adding the dataset to tensorflow/datasets

If you'd like to share your work with the community, you can check in your dataset implementation to tensorflow/datasets. Thanks for thinking of contributing!

Before you send your pull request, follow these last few steps:

1. Add an import for registration

All subclasses of tfds.core.DatasetBuilder are automatically registered when their module is imported such that they can be accessed through tfds.builder and tfds.load.

If you're contributing the dataset to tensorflow/datasets, add the module import to its subdirectory's __init__.py (e.g. image/__init__.py.

2. Run download_and_prepare locally.

Run download_and_prepare locally to ensure that data generation works:

# default data_dir is ~/tensorflow_datasets
python -m tensorflow_datasets.scripts.download_and_prepare \
  --datasets=my_new_dataset

Copy in the contents of the dataset_info.json file(s) to a GitHub gist and link to it in your pull request.

3. Double-check the citation

It's important that DatasetInfo.citation includes a good citation for the dataset. It's hard and important work contributing a dataset to the community and we want to make it easy for dataset users to cite the work.

If the dataset's website has a specifically requested citation, use that (in BibTex format).

If the paper is on arXiv, find it there and click the bibtex link on the right-hand side.

If the paper is not on arXiv, find the paper on Google Scholar and click the double-quotation mark underneath the title and on the popup, click BibTeX.

If there is no associated paper (for example, there's just a website), you can use the BibTeX Online Editor to create a custom BibTeX entry (the drop-down menu has an Online entry type).

4. Add a test

Most datasets in TFDS should have a unit test and your reviewer may ask you to add one if you haven't already. See the testing section below.

Large datasets and distributed generation

Some datasets are so large as to require multiple machines to download and generate. We intend to soon support this use case using Apache Beam. Follow our tracking issue to be updated.

Testing MyDataset

tfds.testing.DatasetBuilderTestCase is a base TestCase to fully exercise a dataset. It uses "fake examples" as test data that mimic the structure of the source dataset.

The test data should be put in in testing/test_data/fake_examples/ under the my_dataset directory and should mimic the source dataset artifacts as downloaded and extracted. It can be created manually or automatically with a script (example script).

Make sure to use different data in your test data splits, as the test will fail if your dataset splits overlap.

The test data should not contain any copyrighted material. If in doubt, do not create the data using material from the original dataset.

import tensorflow as tf
from tensorflow_datasets import my_dataset
import tensorflow_datasets.testing as tfds_test


class MyDatasetTest(tfds_test.DatasetBuilderTestCase):
  DATASET_CLASS = my_dataset.MyDataset
  SPLITS = {  # Expected number of examples on each split from fake example.
      "train": 12,
      "test": 12,
  }
  # If dataset `download_and_extract`s more than one resource:
  DL_EXTRACT_RESULT = {
      "name1": "path/to/file1",  # Relative to fake_examples/my_dataset dir.
      "name2": "file2",
  }

if __name__ == "__main__":
  tfds_test.test_main()

You can run the test as you proceed to implement MyDataset. If you go through all the steps above, it should pass.