Add a new dataset collection

Follow this guide to create a new dataset collection (either in TFDS or in your own repository).


To add a new dataset collection my_collection to TFDS, users need to generate a my_collection folder containing the following files:

my_collection/ # Dataset collection definition # (Optional) test # (Optional) collection description (if not included in # (Optional) collection citations (if not included in

As a convention, new dataset collections should be added to the tensorflow_datasets/dataset_collections/ folder in the TFDS repository.

Write your dataset collection

All dataset collections are implemented subclasses of tfds.core.dataset_collection_builder.DatasetCollection.

Here is a minimal example of a dataset collection builder, defined in the file

import collections
from typing import Mapping
from tensorflow_datasets.core import dataset_collection_builder
from tensorflow_datasets.core import naming

class MyCollection(dataset_collection_builder.DatasetCollection):
  """Dataset collection builder my_dataset_collection."""

  def info(self) -> dataset_collection_builder.DatasetCollectionInfo:
    return dataset_collection_builder.DatasetCollectionInfo.from_cls(
        description="my_dataset_collection description.",
            "1.0.0": "Initial release",

  def datasets(
  ) -> Mapping[str, Mapping[str, naming.DatasetReference]]:
    return collections.OrderedDict({
                "dataset_1": "natural_questions/default:0.0.2",
                "dataset_2": "media_sum:1.0.0",
                "dataset_1": "natural_questions/longt5:0.1.0",
                "dataset_2": "media_sum:1.0.0",
                "dataset_3": "squad:3.0.0"

The next sections describe the 2 abstract methods to overwrite.

info: dataset collection metadata

The info method returns the dataset_collection_builder.DatasetCollectionInfo containing the collection's metadata.

The dataset collection info contains four fields:

  • name: the name of the dataset collection.
  • description: a markdown-formatted description of the dataset collection. There are two ways to define a dataset collection's description: (1) As a (multi-line) string directly in the collection's file - similarly as it is already done for TFDS datasets; (2) In a file, which must be placed in the dataset collection folder.
  • release_notes: a mapping from the dataset collection's version to the corresponding release notes.
  • citation: An optional (list of) BibTeX citation(s) for the dataset collection. There are two ways to define a dataset collection's citation: (1) As a (multi-line) string directly in the collection's file - similarly as it is already done for TFDS datasets; (2) In a citations.bib file, which must be placed in the dataset collection folder.

datasets: define the datasets in the collection

The datasets method returns the TFDS datasets in the collection.

It is defined as a dictionary of versions, which describe the evolution of the dataset collection.

For each version, the included TFDS datasets are stored as a dictionary from dataset names to naming.DatasetReference. For example:

class MyCollection(dataset_collection_builder.DatasetCollection):
  def datasets(self):
    return {
        "1.0.0": {
                    dataset_name="yes_no", version="1.0.0"),
                    dataset_name="glue", config="sst2", version="2.0.0"),
                    dataset_name="assin2", version="1.0.0"),

The naming.references_for method provides a more compact way to express the same as above:

class MyCollection(dataset_collection_builder.DatasetCollection):
  def datasets(self):
    return {
                "yes_no": "yes_no:1.0.0",
                "sst2": "glue/sst:2.0.0",
                "assin2": "assin2:1.0.0",

Unit-test your dataset collection

DatasetCollectionTestBase is a base test class for dataset collections. It provides a number of simple checks to guarantee that the dataset collection is correctly registered, and its datasets exist in TFDS.

The only class attribute to set is DATASET_COLLECTION_CLASS, which specifies the class object of dataset collection to test.

Additionally, users can set the following class attributes:

  • VERSION: The version of the dataset collection used to run the test (defaults to the latest version).
  • DATASETS_TO_TEST: List containing the datasets to test existence for in TFDS (defaults to all datasets in the collection).
  • CHECK_DATASETS_VERSION: Whether to check for the existence of the versioned datasets in the dataset collection, or for their default versions (defaults to true).

The simplest valid test for a dataset collection would be:

from tensorflow_datasets.testing.dataset_collection_builder_testing import DatasetCollectionTestBase
from . import my_collection

class TestMyCollection(DatasetCollectionTestBase):
  DATASET_COLLECTION_CLASS = my_collection.MyCollection

Run the following command to test the dataset collection.



We are continuously trying to improve the dataset creation workflow, but can only do so if we are aware of the issues. Which issues or errors did you encounter while creating the dataset collection? Was there a part which was confusing, or wasn't working the first time?

Please share your feedback on GitHub.