तिथि को रक्षित करें! Google I / O 18-20 मई को पंजीकृत करता है


Base class for datasets with data generation based on file adapter.

Inherits From: DatasetBuilder

GeneratorBasedBuilder is a convenience class that abstracts away much of the data writing and reading of DatasetBuilder.

It expects subclasses to overwrite _split_generators to return a dict of splits, generators. See the method docstrings for details.

file_format EXPERIMENTAL, may change at any time; Format of the record files in which dataset will be read/written to. Defaults to tfrecord.
**kwargs Arguments passed to DatasetBuilder.

builder_config tfds.core.BuilderConfig for this builder.



info tfds.core.DatasetInfo for this builder.



versions Versions (canonical + availables), in preference order.



View source

Constructs a tf.data.Dataset.

Callers must pass arguments as keyword arguments.

The output types vary depending on the parameters. Examples:

builder = tfds.builder('imdb_reviews')

# Default parameters: Returns the dict of tf.data.Dataset
ds_all_dict = builder.as_dataset()
assert isinstance(ds_all_dict, dict)
print(ds_all_dict.keys())  # ==> ['test', 'train', 'unsupervised']

assert isinstance(ds_all_dict['test'], tf.data.Dataset)
# Each dataset (test, train, unsup.) consists of dictionaries
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
#  'text': <tf.Tensor: .. dtype=string, numpy=b"I've watched the movie ..">}
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
#  'text': <tf.Tensor: .. dtype=string, numpy=b'If you love Japanese ..'>}

# With as_supervised: tf.data.Dataset only contains (feature, label) tuples
ds_all_supervised = builder.as_dataset(as_supervised=True)
assert isinstance(ds_all_supervised, dict)
print(ds_all_supervised.keys())  # ==> ['test', 'train', 'unsupervised']

assert isinstance(ds_all_supervised['test'], tf.data.Dataset)
# Each dataset (test, train, unsup.) consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)

# Same as above plus requesting a particular split
ds_test_supervised = builder.as_dataset(as_supervised=True, split='test')
assert isinstance(ds_test_supervised, tf.data.Dataset)
# The dataset consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)

split Which split of the data to load (e.g. 'train', 'test', ['train', 'test'], 'train[80%:]',...). See our split API guide. If None, will return all splits in a Dict[Split, tf.data.Dataset].
batch_size int, batch size. Note that variable-length features will be 0-padded if batch_size is set. Users that want more custom behavior should use batch_size=None and use the tf.data API to construct a custom pipeline. If batch_size == -1, will return feature dictionaries of the whole dataset with tf.Tensors instead of a tf.data.Dataset.
shuffle_files bool, whether to shuffle the input files. Defaults to False.
decoders Nested dict of Decoder objects which allow to customize the decoding. The structure should match the feature structure, but only customized feature keys need to be present. See the guide for more info.
read_config tfds.ReadConfig, Additional options to configure the input pipeline (e.g. seed, num parallel reads,...).
as_supervised bool, if True, the returned tf.data.Dataset will have a 2-tuple structure (input, label) according to builder.info.supervised_keys. If False, the default, the returned tf.data.Dataset will have a dictionary with all the features.

tf.data.Dataset, or if split=None, dict<key: tfds.Split, value: tfds.data.Dataset>.

If batch_size is -1, will return feature dictionaries containing the entire dataset in tf.Tensors instead of a tf.data.Dataset.


View source

Downloads and prepares dataset for reading.

download_dir str, directory where downloaded files are stored. Defaults to "~/tensorflow-datasets/downloads".
download_config tfds.download.DownloadConfig, further configuration for downloading and preparing dataset.

IOError if there is not enough disk space available.


View source

Default function to generate examples for each split.

The function should return a collection of (key, examples). Examples will be encoded are written to disk. See yields section for details.

The function can return/yield:

  • A python generator:
def _generate_examples(self, path):
  for filepath in path.iterdir():
    yield filepath.name, {'image': ..., 'label': ...}
  • A beam.PTransform of (input_types: [] -> output_types: KeyExample): For big datasets and distributed generation. See our Apache Beam datasets guide for more info.
def _generate_examples(self, path):
  return (
      | beam.Map(lambda filepath: filepath.name, {'image': ..., ...})
  • A beam.PCollection: This should only be used if you need to share some distributed processing accross splits. In this case, you can use the following pattern:
def _split_generators(self, dl_manager, pipeline):
  # Distributed processing shared across splits
  pipeline |= beam.Create(path.iterdir())
  pipeline |= 'SharedPreprocessing' >> beam.Map(_common_processing)
  # Wrap the pipeline inside a ptransform_fn to add `'label' >> ` and avoid
  # duplicated PTransform nodes names.
  generate_examples = beam.ptransform_fn(self._generate_examples)
  return {
      'train': pipeline | 'train' >> generate_examples(is_train=True)
      'test': pipeline | 'test' >> generate_examples(is_train=False)

def _generate_examples(self, pipeline, is_train: bool):
  return pipeline | beam.Map(_split_specific_processing, is_train=is_train)

**kwargs Arguments from the _split_generators

key str or int, a unique deterministic example identification key.

  • Unique: An error will be raised if two examples are yield with the same key.
  • Deterministic: When generating the dataset twice, the same example should have the same key. Good keys can be the image id, or line number if examples are extracted from a text file. The key will be hashed and sorted to shuffle examples deterministically, such as generating the dataset multiple times keep examples in the same order.
example dict<str feature_name, feature_value>, a feature dictionary ready to be encoded and written to disk. The example will be encoded with self.info.features.encode_example({...}).


View source

Returns the tfds.core.DatasetInfo object.

This function is called once and the result is cached for all following calls.

dataset_info The dataset metadata.


View source

Downloads the data and returns dataset splits with associated examples.


def _split_generators(self, dl_manager):
  path = dl_manager.download_and_extract('http://dataset.org/my_data.zip')
  return {
      'train': self._generate_examples(path=path / 'train_imgs'),
      'test': self._generate_examples(path=path / 'test_imgs'),
  • If the original dataset do not have predefined train, test,... splits, this function should only returns a single train split here. Users can use the subsplit API to create subsplits (e.g. tfds.load(..., split=['train[:75%]', 'train[75%:]'])).
  • tfds.download.DownloadManager caches downloads, so calling download on the same url multiple times only download it once.
  • A good practice is to download all data in this function, and have all the computation inside _generate_examples.
  • Splits are generated in the order defined here. builder.info.splits keep the same order.
  • This function can have an extra pipeline kwarg only if some beam preprocessing should be shared across splits. In this case, a dict of beam.PCollection should be returned. See _generate_example for details.

dl_manager tfds.download.DownloadManager used to download/extract the data

The dict of split name, generators. See _generate_examples for details about the generator format.







code_path Instance of tensorflow_datasets.core.utils.gpath.PosixGPath
name 'generator_based_builder'
url_infos None