TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

tfds.core.DatasetBuilder

Abstract base class for all datasets.

tfds.core.DatasetBuilder(
    *,
    data_dir: Optional[epath.PathLike] = None,
    config: Union[None, str, BuilderConfig] = None,
    version: Union[None, str, utils.Version] = None
)

DatasetBuilder has 3 key methods:

DatasetBuilder.info: documents the dataset, including feature names, types, and shapes, version, splits, citation, etc.
DatasetBuilder.download_and_prepare: downloads the source data and writes it to disk.
DatasetBuilder.as_dataset: builds an input pipeline using tf.data.Datasets.

Configuration: Some DatasetBuilders expose multiple variants of the dataset by defining a tfds.core.BuilderConfig subclass and accepting a config object (or name) on construction. Configurable datasets expose a pre-defined set of configurations in DatasetBuilder.builder_configs.

Typical DatasetBuilder usage:

mnist_builder = tfds.builder("mnist")
mnist_info = mnist_builder.info
mnist_builder.download_and_prepare()
datasets = mnist_builder.as_dataset()

train_dataset, test_dataset = datasets["train"], datasets["test"]
assert isinstance(train_dataset, tf.data.Dataset)

# And then the rest of your input pipeline
train_dataset = train_dataset.repeat().shuffle(1024).batch(128)
train_dataset = train_dataset.prefetch(2)
features = tf.compat.v1.data.make_one_shot_iterator(train_dataset).get_next()
image, label = features['image'], features['label']

Args
`data_dir`	directory to read/write data. Defaults to the value of the environment variable TFDS_DATA_DIR, if set, otherwise falls back to datasets are stored.
`config`	`tfds.core.BuilderConfig` or `str` name, optional configuration for the dataset that affects the data generated on disk. Different `builder_config`s will have their own subdirectories and versions.
`version`	Optional version at which to load the dataset. An error is raised if specified version cannot be satisfied. Eg: '1.2.3', '1.2.*'. The special value "experimental_latest" will use the highest version, even if not default. This is not recommended unless you know what you are doing, as the version could be broken.

Attributes
`builder_config`	`tfds.core.BuilderConfig` for this builder.
`canonical_version`
`data_dir`	Returns the directory where this version + config is stored. Note that this is different from `data_dir_root`. For example, if `data_dir_root` is `/data/tfds`, then `data_dir` would be `/data/tfds/my_dataset/my_config/1.2.3`.
`data_dir_root`	Returns the root directory where all TFDS datasets are stored. Note that this is different from `data_dir`, which includes the dataset name, config, and version. For example, if `data_dir` is `/data/tfds/my_dataset/my_config/1.2.3`, then `data_dir_root` is `/data/tfds`.
`data_path`	Returns the path where this version + config is stored.
`info`	`tfds.core.DatasetInfo` for this builder.
`release_notes`
`supported_versions`
`version`
`versions`	Versions (canonical + availables), in preference order.

Methods

`as_data_source`

View source

as_data_source(
    split: Optional[Tree[splits_lib.SplitArg]] = None,
    *,
    decoders: Optional[TreeDict[decode.partial_decode.DecoderArg]] = None
) -> ListOrTreeOrElem[Sequence[Any]]

Constructs an ArrayRecordDataSource.

Args
`split`	Which split of the data to load (e.g. `'train'`, `'test'`, `['train', 'test']`, `'train[80%:]'`,...). See our split API guide. If `None`, will return all splits in a `Dict[Split, Sequence]`.
`decoders`	Nested dict of `Decoder` objects which allow to customize the decoding. The structure should match the feature structure, but only customized feature keys need to be present. See the guide for more info.

Returns
`Sequence` if `split`, `dict<key: tfds.Split, value: Sequence>` otherwise.

Raises
NotImplementedError if the data was not generated using ArrayRecords.

`as_dataset`

View source

as_dataset(
    split: Optional[Tree[splits_lib.SplitArg]] = None,
    *,
    batch_size: Optional[int] = None,
    shuffle_files: bool = False,
    decoders: Optional[TreeDict[decode.partial_decode.DecoderArg]] = None,
    read_config: Optional[read_config_lib.ReadConfig] = None,
    as_supervised: bool = False
)

Constructs a tf.data.Dataset.

Callers must pass arguments as keyword arguments.

The output types vary depending on the parameters. Examples:

builder = tfds.builder('imdb_reviews')
builder.download_and_prepare()

# Default parameters: Returns the dict of tf.data.Dataset
ds_all_dict = builder.as_dataset()
assert isinstance(ds_all_dict, dict)
print(ds_all_dict.keys())  # ==> ['test', 'train', 'unsupervised']

assert isinstance(ds_all_dict['test'], tf.data.Dataset)
# Each dataset (test, train, unsup.) consists of dictionaries
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
#  'text': <tf.Tensor: .. dtype=string, numpy=b"I've watched the movie ..">}
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
#  'text': <tf.Tensor: .. dtype=string, numpy=b'If you love Japanese ..'>}

# With as_supervised: tf.data.Dataset only contains (feature, label) tuples
ds_all_supervised = builder.as_dataset(as_supervised=True)
assert isinstance(ds_all_supervised, dict)
print(ds_all_supervised.keys())  # ==> ['test', 'train', 'unsupervised']

assert isinstance(ds_all_supervised['test'], tf.data.Dataset)
# Each dataset (test, train, unsup.) consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)

# Same as above plus requesting a particular split
ds_test_supervised = builder.as_dataset(as_supervised=True, split='test')
assert isinstance(ds_test_supervised, tf.data.Dataset)
# The dataset consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)

Args
`split`	Which split of the data to load (e.g. `'train'`, `'test'`, `['train', 'test']`, `'train[80%:]'`,...). See our split API guide. If `None`, will return all splits in a `Dict[Split, tf.data.Dataset]`.
`batch_size`	`int`, batch size. Note that variable-length features will be 0-padded if `batch_size` is set. Users that want more custom behavior should use `batch_size=None` and use the `tf.data` API to construct a custom pipeline. If `batch_size == -1`, will return feature dictionaries of the whole dataset with `tf.Tensor`s instead of a `tf.data.Dataset`.
`shuffle_files`	`bool`, whether to shuffle the input files. Defaults to `False`.
`decoders`	Nested dict of `Decoder` objects which allow to customize the decoding. The structure should match the feature structure, but only customized feature keys need to be present. See the guide for more info.
`read_config`	`tfds.ReadConfig`, Additional options to configure the input pipeline (e.g. seed, num parallel reads,...).
`as_supervised`	`bool`, if `True`, the returned `tf.data.Dataset` will have a 2-tuple structure `(input, label)` according to `builder.info.supervised_keys`. If `False`, the default, the returned `tf.data.Dataset` will have a dictionary with all the features.

Returns

Returns
`tf.data.Dataset`, or if `split=None`, `dict<key: tfds.Split, value: tf.data.Dataset>`. If `batch_size` is -1, will return feature dictionaries containing the entire dataset in `tf.Tensor`s instead of a `tf.data.Dataset`.

tf.data.Dataset, or if split=None,

dict<key: tfds.Split, value:
tf.data.Dataset>

If batch_size is -1, will return feature dictionaries containing the entire dataset in tf.Tensors instead of a tf.data.Dataset.

`dataset_info_from_configs`

View source

dataset_info_from_configs(
    **kwargs
)

Returns the DatasetInfo using given kwargs and config files.

Sub-class should call this and add information not present in config files using kwargs directly passed to tfds.core.DatasetInfo object.

If information is present both in passed arguments and config files, config files will prevail.

Args
`**kwargs`	kw args to pass to DatasetInfo directly.

`download_and_prepare`

View source

download_and_prepare(
    *,
    download_dir: Optional[epath.PathLike] = None,
    download_config: Optional[download.DownloadConfig] = None,
    file_format: Optional[Union[str, file_adapters.FileFormat]] = None
) -> None

Downloads and prepares dataset for reading.

Args
`download_dir`	directory where downloaded files are stored. Defaults to "~/tensorflow-datasets/downloads".
`download_config`	`tfds.download.DownloadConfig`, further configuration for downloading and preparing dataset.
`file_format`	optional `str` or `file_adapters.FileFormat`, format of the record files in which the dataset will be written.

Raises
`IOError`	if there is not enough disk space available.
`RuntimeError`	when the config cannot be found.

`get_default_builder_config`

View source

get_default_builder_config() -> Optional[BuilderConfig]

Returns the default builder config if there is one.

Note that for dataset builders that cannot use the cls.BUILDER_CONFIGS, we need a method that uses the instance to get BUILDER_CONFIGS and DEFAULT_BUILDER_CONFIG_NAME.

Returns
the default builder config if there is one

`get_metadata`

View source

@classmethod
get_metadata() -> dataset_metadata.DatasetMetadata

Returns metadata (README, CITATIONS, ...) specified in config files.

The config files are read from the same package where the DatasetBuilder has been defined, so those metadata might be wrong for legacy builders.

`get_reference`

View source

get_reference(
    namespace: Optional[str] = None
) -> naming.DatasetReference

Returns a reference to the dataset produced by this dataset builder.

Includes the config if specified, the version, and the data_dir that should contain this dataset.

Arguments
`namespace`	if this dataset is a community dataset, and therefore has a namespace, then the namespace must be provided such that it can be set in the reference. Note that a dataset builder is not aware that it is part of a namespace.

Returns
a reference to this instantiated builder.

`is_prepared`

View source

is_prepared() -> bool

Returns whether this dataset is already downloaded and prepared.

Class Variables
BUILDER_CONFIGS	`[]`
DEFAULT_BUILDER_CONFIG_NAME	`None`
MANUAL_DOWNLOAD_INSTRUCTIONS	`None`
MAX_SIMULTANEOUS_DOWNLOADS	`None`
RELEASE_NOTES	`}`
SUPPORTED_VERSIONS	`[]`
VERSION	`None`
builder_config_cls	`None`
builder_configs	`}`
code_path	Instance of `etils.epath.gpath.PosixGPath`
default_builder_config	`None`
name	`'dataset_builder'`
pkg_dir_path	`None`
url_infos	`None`

tfds.core.DatasetBuilder

Args

Attributes

Methods

as_data_source

as_dataset

dataset_info_from_configs

download_and_prepare

get_default_builder_config

get_metadata

get_reference

is_prepared

Class Variables

`as_data_source`

`as_dataset`

`dataset_info_from_configs`

`download_and_prepare`

`get_default_builder_config`

`get_metadata`

`get_reference`

`is_prepared`