Apply to speak at TensorFlow World. Deadline April 23rd. Propose talk

tfds.core.DatasetBuilder

Class DatasetBuilder

Defined in core/dataset_builder.py.

Abstract base class for all datasets.

DatasetBuilder has 3 key methods:

  • tfds.DatasetBuilder.info: documents the dataset, including feature names, types, and shapes, version, splits, citation, etc.
  • tfds.DatasetBuilder.download_and_prepare: downloads the source data and writes it to disk.
  • tfds.DatasetBuilder.as_dataset: builds an input pipeline using tf.data.Datasets.

Configuration: Some DatasetBuilders expose multiple variants of the dataset by defining a tfds.core.BuilderConfig subclass and accepting a config object (or name) on construction. Configurable datasets expose a pre-defined set of configurations in tfds.DatasetBuilder.builder_configs.

Typical DatasetBuilder usage:

mnist_builder = tfds.builder("mnist")
mnist_info = mnist_builder.info
mnist_builder.download_and_prepare()
datasets = mnist_builder.as_dataset()

train_dataset, test_dataset = datasets["train"], datasets["test"]
assert isinstance(train_dataset, tf.data.Dataset)

# And then the rest of your input pipeline
train_dataset = train_dataset.repeat().shuffle(1024).batch(128)
train_dataset = train_dataset.prefetch(2)
features = tf.compat.v1.data.make_one_shot_iterator(train_dataset).get_next()
image, label = features['image'], features['label']

__init__

__init__(
    data_dir=None,
    config=None
)

Constructs a DatasetBuilder.

Callers must pass arguments as keyword arguments.

Args:

  • data_dir: str, directory to read/write data. Defaults to "~/tensorflow_datasets".
  • config: tfds.core.BuilderConfig or str name, optional configuration for the dataset that affects the data generated on disk. Different builder_configs will have their own subdirectories and versions.

Properties

builder_config

tfds.core.BuilderConfig for this builder.

info

tfds.core.DatasetInfo for this builder.

Methods

as_dataset

as_dataset(
    split=None,
    batch_size=1,
    shuffle_files=None,
    as_supervised=False
)

Constructs a tf.data.Dataset.

Callers must pass arguments as keyword arguments.

Args:

  • split: tfds.core.SplitBase, which subset(s) of the data to read. If None (default), returns all splits in a dict <key: tfds.Split, value: tf.data.Dataset>.
  • batch_size: int, batch size. Note that variable-length features will be 0-padded if batch_size > 1. Users that want more custom behavior should use batch_size=1 and use the tf.data API to construct a custom pipeline. If batch_size == -1, will return feature dictionaries of the whole dataset with tf.Tensors instead of a tf.data.Dataset.
  • shuffle_files: bool, whether to shuffle the input files. Defaults to True if split == tfds.Split.TRAIN and False otherwise.
  • as_supervised: bool, if True, the returned tf.data.Dataset will have a 2-tuple structure (input, label) according to builder.info.supervised_keys. If False, the default, the returned tf.data.Dataset will have a dictionary with all the features.

Returns:

tf.data.Dataset, or if split=None, dict<key: tfds.Split, value: tfds.data.Dataset>.

If batch_size is -1, will return feature dictionaries containing the entire dataset in tf.Tensors instead of a tf.data.Dataset.

download_and_prepare

download_and_prepare(
    download_dir=None,
    download_config=None
)

Downloads and prepares dataset for reading.

Args:

  • download_dir: str, directory where downloaded files are stored. Defaults to "~/tensorflow-datasets/downloads".
  • download_config: tfds.download.DownloadConfig, further configuration for downloading and preparing dataset.

Class Members

BUILDER_CONFIGS

IN_DEVELOPMENT

VERSION

builder_configs

name