Watch keynotes, product sessions, workshops, and more from Google I/O See playlist

Common implementation gotchas

This page describe the common implementation gotcha when implementing a new dataset.

Legacy SplitGenerator should be avoided

The old tfds.core.SplitGenerator API is deprecated.

def _split_generator(...):
  return [
      tfds.core.SplitGenerator(name='train', gen_kwargs={'path': train_path}),
      tfds.core.SplitGenerator(name='test', gen_kwargs={'path': test_path}),
  ]

Should be replaced by:

def _split_generator(...):
  return {
      'train': self._generate_examples(path=train_path),
      'test': self._generate_examples(path=test_path),
  }

Rationale: The new API is less verbose and more explicit. The old API will be removed in future version.

New datasets should be self-contained in a folder

When adding a dataset inside the tensorflow_datasets/ repository, please make sure to follow the dataset-as-folder structure (all checksums, dummy data, implementation code self-contained in a folder).

  • Old datasets (bad): <category>/<ds_name>.py
  • New datasets (good): <category>/<ds_name>/<ds_name>.py

Use the TFDS CLI (tfds new, or gtfds new for googlers) to generate the template.

Rationale: Old structure required absolute paths for checksums, fake data and was distributing the dataset files in many places. It was making it harder to implement datasets outside the TFDS repository. For consistency, the new structure should be used everywhere now.

Description lists should be formatted as markdown

The DatasetInfo.description str is formatted as markdown. Markdown lists require an empty line before the first item:

_DESCRIPTION = """
Some text.
                      # << Empty line here !!!

1. Item 1
2. Item 1
3. Item 1
                      # << Empty line here !!!
Some other text.
"""

Rationale: Badly formatted description create visual artifacts in our catalog documentation. Without the empty lines, the above text would be rendered as:

Some text. 1. Item 1 2. Item 1 3. Item 1 Some other text

Forgot ClassLabel names

When using tfds.features.ClassLabel, try to provide the human-readable labels str with names= or names_file= (instead of num_classes=10).

features = {
    'label': tfds.features.ClassLabel(names=['dog', 'cat', ...]),
}

Rationale: Human readable labels are used in many places:

Forgot image shape

When using tfds.features.Image, tfds.features.Video, if the images have static shape, they should be expliclty specified:

features = {
    'image': tfds.features.Image(shape=(256, 256, 3)),
}

Rationale: It allow static shape inference (e.g. ds.element_spec['image'].shape), which is required for batching (batching images of unknown shape would require resizing them first).

Prefer more specific type instead of tfds.features.Tensor

When possible, prefer the more specific types tfds.features.ClassLabel, tfds.features.BBoxFeatures,... instead of the generic tfds.features.Tensor.

Rationale: In addition of being more semantically correct, specific features provides additional metadata to users and are detected by tools.

Lazy imports in global space

Lazy imports should not be called from the global space. For example the following is wrong:

tfds.lazy_imports.apache_beam # << Error: Import beam in the global scope

def f() -> beam.Map:
  ...

Rationale: Using lazy imports in the global scope would import the module for all tfds users, defeating the purpose of lazy imports.

Dynamically computing train/test splits

If the dataset does not provide official splits, neither should TFDS. The following should be avoided:

_TRAIN_TEST_RATIO = 0.7

def _split_generator():
  ids = list(range(num_examples))
  np.random.RandomState(seed).shuffle(ids)

  # Split train/test
  train_ids = ids[_TRAIN_TEST_RATIO * num_examples:]
  test_ids = ids[:_TRAIN_TEST_RATIO * num_examples]
  return {
      'train': self._generate_examples(train_ids),
      'test': self._generate_examples(test_ids),
  }

Rationale: TFDS try to provide datasets as close as the original data. The sub-split API should be used instead to let users dynamically create the subsplits they want:

ds_train, ds_test = tfds.load(..., split=['train[:80%]', 'train[80%:]'])