TF 2.0 is out! Get hands-on practice at TF World, Oct 28-31. Use code TF20 for 20% off select passes. Register now

tfds.core.GeneratorBasedBuilder

View source

Class GeneratorBasedBuilder

Base class for datasets with data generation based on dict generators.

GeneratorBasedBuilder is a convenience class that abstracts away much of the data writing and reading of DatasetBuilder. It expects subclasses to implement generators of feature dictionaries across the dataset splits (_split_generators) and to specify a file type (_file_format_adapter). See the method docstrings for details.

FileFormatAdapters are defined in tensorflow_datasets.core.file_format_adapter and specify constraints on the feature dictionaries yielded by example generators. See the class docstrings.

__init__

View source

__init__(
    data_dir=None,
    config=None,
    version=None
)

Constructs a DatasetBuilder.

Callers must pass arguments as keyword arguments.

Args:

  • data_dir: str, directory to read/write data. Defaults to datasets are stored.
  • config: tfds.core.BuilderConfig or str name, optional configuration for the dataset that affects the data generated on disk. Different builder_configs will have their own subdirectories and versions.
  • version: str. Optional version at which to load the dataset. An error is raised if specified version cannot be satisfied. Eg: '1.2.3', '1.2.*'. The special value "experimental_latest" will use the highest version, even if not default. This is not recommended unless you know what you are doing, as the version could be broken.

Properties

builder_config

tfds.core.BuilderConfig for this builder.

data_dir

info

tfds.core.DatasetInfo for this builder.

version

Methods

as_dataset

View source

as_dataset(
    split=None,
    batch_size=None,
    shuffle_files=None,
    decoders=None,
    as_supervised=False,
    in_memory=None
)

Constructs a tf.data.Dataset.

Callers must pass arguments as keyword arguments.

The output types vary depending on the parameters. Examples:

builder = tfds.builder('imdb_reviews:1.*.*')
builder.download_and_prepare()

# Default parameters: Returns the dict of tf.data.Dataset
ds_all_dict = builder.as_dataset()
assert isinstance(ds_all_dict, dict)
print(ds_all_dict.keys())  # ==> ['test', 'train', 'unsupervised']

assert isinstance(ds_all_dict['test'], tf.data.Dataset)
# Each dataset (test, train, unsup.) consists of dictionaries
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
#  'text': <tf.Tensor: .. dtype=string, numpy=b"I've watched the movie ..">}
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
#  'text': <tf.Tensor: .. dtype=string, numpy=b'If you love Japanese ..'>}

# With as_supervised: tf.data.Dataset only contains (feature, label) tuples
ds_all_supervised = builder.as_dataset(as_supervised=True)
assert isinstance(ds_all_supervised, dict)
print(ds_all_supervised.keys())  # ==> ['test', 'train', 'unsupervised']

assert isinstance(ds_all_supervised['test'], tf.data.Dataset)
# Each dataset (test, train, unsup.) consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)

# Same as above plus requesting a particular split
ds_test_supervised = builder.as_dataset(as_supervised=True, split='test')
assert isinstance(ds_test_supervised, tf.data.Dataset)
# The dataset consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)

Args:

  • split: tfds.core.SplitBase, which subset(s) of the data to read. If None (default), returns all splits in a dict <key: tfds.Split, value: tf.data.Dataset>.
  • batch_size: int, batch size. Note that variable-length features will be 0-padded if batch_size is set. Users that want more custom behavior should use batch_size=None and use the tf.data API to construct a custom pipeline. If batch_size == -1, will return feature dictionaries of the whole dataset with tf.Tensors instead of a tf.data.Dataset.
  • shuffle_files: bool, whether to shuffle the input files. Defaults to True if split == tfds.Split.TRAIN and False otherwise.
  • decoders: Nested dict of Decoder objects which allow to customize the decoding. The structure should match the feature structure, but only customized feature keys need to be present. See the guide for more info.
  • as_supervised: bool, if True, the returned tf.data.Dataset will have a 2-tuple structure (input, label) according to builder.info.supervised_keys. If False, the default, the returned tf.data.Dataset will have a dictionary with all the features.
  • in_memory: bool, if True, loads the dataset in memory which increases iteration speeds. Note that if True and the dataset has unknown dimensions, the features will be padded to the maximum size across the dataset.

Returns:

tf.data.Dataset, or if split=None, dict<key: tfds.Split, value: tfds.data.Dataset>.

If batch_size is -1, will return feature dictionaries containing the entire dataset in tf.Tensors instead of a tf.data.Dataset.

download_and_prepare

View source

download_and_prepare(
    download_dir=None,
    download_config=None
)

Downloads and prepares dataset for reading.

Args:

  • download_dir: str, directory where downloaded files are stored. Defaults to "~/tensorflow-datasets/downloads".
  • download_config: tfds.download.DownloadConfig, further configuration for downloading and preparing dataset.

Raises:

  • IOError: if there is not enough disk space available.

Class Members

  • BUILDER_CONFIGS
  • SUPPORTED_VERSIONS
  • VERSION = None
  • builder_configs
  • name = 'generator_based_builder'