TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

tfds.core.GeneratorBasedBuilder

Base class for datasets with data generation based on file adapter.

Inherits From: DatasetBuilder

tfds.core.GeneratorBasedBuilder(
    *, file_format: Union[None, str, file_adapters.FileFormat] = None, **kwargs
)

GeneratorBasedBuilder is a convenience class that abstracts away much of the data writing and reading of DatasetBuilder.

It expects subclasses to overwrite _split_generators to return a dict of splits, generators. See the method docstrings for details.

Args
`file_format`	EXPERIMENTAL, may change at any time; Format of the record files in which dataset will be read/written to. If `None`, defaults to `tfrecord`.
`**kwargs`	Arguments passed to `DatasetBuilder`.

Attributes
`builder_config`	`tfds.core.BuilderConfig` for this builder.
`canonical_version`
`data_dir`	Returns the directory where this version + config is stored. Note that this is different from `data_dir_root`. For example, if `data_dir_root` is `/data/tfds`, then `data_dir` would be `/data/tfds/my_dataset/my_config/1.2.3`.
`data_dir_root`	Returns the root directory where all TFDS datasets are stored. Note that this is different from `data_dir`, which includes the dataset name, config, and version. For example, if `data_dir` is `/data/tfds/my_dataset/my_config/1.2.3`, then `data_dir_root` is `/data/tfds`.
`data_path`	Returns the path where this version + config is stored.
`info`	`tfds.core.DatasetInfo` for this builder.
`release_notes`
`supported_versions`
`version`
`versions`	Versions (canonical + availables), in preference order.

Methods

`as_data_source`

View source

as_data_source(
    split: Optional[Tree[splits_lib.SplitArg]] = None,
    *,
    decoders: Optional[TreeDict[decode.partial_decode.DecoderArg]] = None
) -> ListOrTreeOrElem[Sequence[Any]]

Constructs an ArrayRecordDataSource.

Args
`split`	Which split of the data to load (e.g. `'train'`, `'test'`, `['train', 'test']`, `'train[80%:]'`,...). See our split API guide. If `None`, will return all splits in a `Dict[Split, Sequence]`.
`decoders`	Nested dict of `Decoder` objects which allow to customize the decoding. The structure should match the feature structure, but only customized feature keys need to be present. See the guide for more info.

Returns
`Sequence` if `split`, `dict<key: tfds.Split, value: Sequence>` otherwise.

Raises
NotImplementedError if the data was not generated using ArrayRecords.

`as_dataset`

View source

as_dataset(
    split: Optional[Tree[splits_lib.SplitArg]] = None,
    *,
    batch_size: Optional[int] = None,
    shuffle_files: bool = False,
    decoders: Optional[TreeDict[decode.partial_decode.DecoderArg]] = None,
    read_config: Optional[read_config_lib.ReadConfig] = None,
    as_supervised: bool = False
)

Constructs a tf.data.Dataset.

Callers must pass arguments as keyword arguments.

The output types vary depending on the parameters. Examples:

builder = tfds.builder('imdb_reviews')
builder.download_and_prepare()

# Default parameters: Returns the dict of tf.data.Dataset
ds_all_dict = builder.as_dataset()
assert isinstance(ds_all_dict, dict)
print(ds_all_dict.keys())  # ==> ['test', 'train', 'unsupervised']

assert isinstance(ds_all_dict['test'], tf.data.Dataset)
# Each dataset (test, train, unsup.) consists of dictionaries
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
#  'text': <tf.Tensor: .. dtype=string, numpy=b"I've watched the movie ..">}
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
#  'text': <tf.Tensor: .. dtype=string, numpy=b'If you love Japanese ..'>}

# With as_supervised: tf.data.Dataset only contains (feature, label) tuples
ds_all_supervised = builder.as_dataset(as_supervised=True)
assert isinstance(ds_all_supervised, dict)
print(ds_all_supervised.keys())  # ==> ['test', 'train', 'unsupervised']

assert isinstance(ds_all_supervised['test'], tf.data.Dataset)
# Each dataset (test, train, unsup.) consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)

# Same as above plus requesting a particular split
ds_test_supervised = builder.as_dataset(as_supervised=True, split='test')
assert isinstance(ds_test_supervised, tf.data.Dataset)
# The dataset consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)

Args
`split`	Which split of the data to load (e.g. `'train'`, `'test'`, `['train', 'test']`, `'train[80%:]'`,...). See our split API guide. If `None`, will return all splits in a `Dict[Split, tf.data.Dataset]`.
`batch_size`	`int`, batch size. Note that variable-length features will be 0-padded if `batch_size` is set. Users that want more custom behavior should use `batch_size=None` and use the `tf.data` API to construct a custom pipeline. If `batch_size == -1`, will return feature dictionaries of the whole dataset with `tf.Tensor`s instead of a `tf.data.Dataset`.
`shuffle_files`	`bool`, whether to shuffle the input files. Defaults to `False`.
`decoders`	Nested dict of `Decoder` objects which allow to customize the decoding. The structure should match the feature structure, but only customized feature keys need to be present. See the guide for more info.
`read_config`	`tfds.ReadConfig`, Additional options to configure the input pipeline (e.g. seed, num parallel reads,...).
`as_supervised`	`bool`, if `True`, the returned `tf.data.Dataset` will have a 2-tuple structure `(input, label)` according to `builder.info.supervised_keys`. If `False`, the default, the returned `tf.data.Dataset` will have a dictionary with all the features.

Returns

Returns
`tf.data.Dataset`, or if `split=None`, `dict<key: tfds.Split, value: tf.data.Dataset>`. If `batch_size` is -1, will return feature dictionaries containing the entire dataset in `tf.Tensor`s instead of a `tf.data.Dataset`.

tf.data.Dataset, or if split=None,

dict<key: tfds.Split, value:
tf.data.Dataset>

If batch_size is -1, will return feature dictionaries containing the entire dataset in tf.Tensors instead of a tf.data.Dataset.

`dataset_info_from_configs`

View source

dataset_info_from_configs(
    **kwargs
)

Returns the DatasetInfo using given kwargs and config files.

Sub-class should call this and add information not present in config files using kwargs directly passed to tfds.core.DatasetInfo object.

If information is present both in passed arguments and config files, config files will prevail.

Args
`**kwargs`	kw args to pass to DatasetInfo directly.

`download_and_prepare`

View source

download_and_prepare(
    *,
    download_dir: Optional[epath.PathLike] = None,
    download_config: Optional[download.DownloadConfig] = None,
    file_format: Optional[Union[str, file_adapters.FileFormat]] = None
) -> None

Downloads and prepares dataset for reading.

Args
`download_dir`	directory where downloaded files are stored. Defaults to "~/tensorflow-datasets/downloads".
`download_config`	`tfds.download.DownloadConfig`, further configuration for downloading and preparing dataset.
`file_format`	optional `str` or `file_adapters.FileFormat`, format of the record files in which the dataset will be written.

Raises
`IOError`	if there is not enough disk space available.
`RuntimeError`	when the config cannot be found.

`get_default_builder_config`

View source

get_default_builder_config() -> Optional[BuilderConfig]

Returns the default builder config if there is one.

Note that for dataset builders that cannot use the cls.BUILDER_CONFIGS, we need a method that uses the instance to get BUILDER_CONFIGS and DEFAULT_BUILDER_CONFIG_NAME.

Returns
the default builder config if there is one

`get_metadata`

View source

@classmethod
get_metadata() -> dataset_metadata.DatasetMetadata

Returns metadata (README, CITATIONS, ...) specified in config files.

The config files are read from the same package where the DatasetBuilder has been defined, so those metadata might be wrong for legacy builders.

`get_reference`

View source

get_reference(
    namespace: Optional[str] = None
) -> naming.DatasetReference

Returns a reference to the dataset produced by this dataset builder.

Includes the config if specified, the version, and the data_dir that should contain this dataset.

Arguments
`namespace`	if this dataset is a community dataset, and therefore has a namespace, then the namespace must be provided such that it can be set in the reference. Note that a dataset builder is not aware that it is part of a namespace.

Returns
a reference to this instantiated builder.

`is_prepared`

View source

is_prepared() -> bool

Returns whether this dataset is already downloaded and prepared.

`read_text_file`

View source

read_text_file(
    filename: epath.PathLike, encoding: Optional[str] = None
) -> str

Returns the text in the given file and records the lineage.

`read_tfrecord_as_dataset`

View source

read_tfrecord_as_dataset(
    filenames: (str | Sequence[str]),
    compression_type: (str | None) = None,
    num_parallel_reads: (int | None) = None
) -> tf.data.Dataset

Returns the dataset for the given tfrecord files and records the lineage.

`read_tfrecord_as_examples`

View source

read_tfrecord_as_examples(
    filenames: Union[str, Sequence[str]],
    compression_type: (str | None) = None,
    num_parallel_reads: (int | None) = None
) -> Iterator[tf.train.Example]

Returns tf.Examples from the given tfrecord files and records the lineage.

`read_tfrecord_beam`

View source

read_tfrecord_beam(
    file_pattern, /, **kwargs
) -> 'beam.PTransform'

Returns a PTransform reading the TFRecords and records it in the dataset lineage.

This function records the lineage in the DatasetInfo and then invokes beam.io.ReadFromTFRecord. The kwargs should contain any other parameters for beam.io.ReadFromTFRecord. See https://beam.apache.org/releases/pydoc/2.6.0/apache_beam.io.tfrecordio.html#apache_beam.io.tfrecordio.ReadFromTFRecord

Arguments
`file_pattern`	A file glob pattern to read TFRecords from.
`**kwargs`	the other parameters for `beam.io.ReadFromTFRecord`.

Returns
a Beam PTransform that reads the given TFRecord files.

Class Variables
BUILDER_CONFIGS	`[]`
DEFAULT_BUILDER_CONFIG_NAME	`None`
MANUAL_DOWNLOAD_INSTRUCTIONS	`None`
MAX_SIMULTANEOUS_DOWNLOADS	`None`
RELEASE_NOTES	`}`
SUPPORTED_VERSIONS	`[]`
VERSION	`None`
builder_config_cls	`None`
builder_configs	`}`
code_path	Instance of `etils.epath.gpath.PosixGPath`
default_builder_config	`None`
name	`'generator_based_builder'`
pkg_dir_path	`None`
url_infos	`None`

tfds.core.GeneratorBasedBuilder Stay organized with collections Save and categorize content based on your preferences.

Args

Attributes

Methods

as_data_source

as_dataset

dataset_info_from_configs

download_and_prepare

get_default_builder_config

get_metadata

get_reference

is_prepared

read_text_file

read_tfrecord_as_dataset

read_tfrecord_as_examples

read_tfrecord_beam

Class Variables

tfds.core.GeneratorBasedBuilder

`as_data_source`

`as_dataset`

`dataset_info_from_configs`

`download_and_prepare`

`get_default_builder_config`

`get_metadata`

`get_reference`

`is_prepared`

`read_text_file`

`read_tfrecord_as_dataset`

`read_tfrecord_as_examples`

`read_tfrecord_beam`