![]() |
Base class for datasets with data generation based on file adapter.
Inherits From: DatasetBuilder
tfds.core.GeneratorBasedBuilder(
*,
file_format: Union[None, str, file_adapters.FileFormat] = file_adapters.DEFAULT_FILE_FORMAT,
**kwargs
)
GeneratorBasedBuilder
is a convenience class that abstracts away much
of the data writing and reading of DatasetBuilder
.
It expects subclasses to overwrite _split_generators
to return a dict of
splits, generators. See the method docstrings for details.
Args | |
---|---|
file_format
|
EXPERIMENTAL, may change at any time; Format of the record
files in which dataset will be read/written to. Defaults to tfrecord .
|
**kwargs
|
Arguments passed to DatasetBuilder .
|
Attributes | |
---|---|
builder_config
|
tfds.core.BuilderConfig for this builder.
|
canonical_version
|
|
data_dir
|
|
data_path
|
|
info
|
tfds.core.DatasetInfo for this builder.
|
release_notes
|
|
supported_versions
|
|
version
|
|
versions
|
Versions (canonical + availables), in preference order. |
Methods
as_dataset
as_dataset(
split=None, *, batch_size=None, shuffle_files=False, decoders=None,
read_config=None, as_supervised=False
)
Constructs a tf.data.Dataset
.
Callers must pass arguments as keyword arguments.
The output types vary depending on the parameters. Examples:
builder = tfds.builder('imdb_reviews')
builder.download_and_prepare()
# Default parameters: Returns the dict of tf.data.Dataset
ds_all_dict = builder.as_dataset()
assert isinstance(ds_all_dict, dict)
print(ds_all_dict.keys()) # ==> ['test', 'train', 'unsupervised']
assert isinstance(ds_all_dict['test'], tf.data.Dataset)
# Each dataset (test, train, unsup.) consists of dictionaries
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
# 'text': <tf.Tensor: .. dtype=string, numpy=b"I've watched the movie ..">}
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
# 'text': <tf.Tensor: .. dtype=string, numpy=b'If you love Japanese ..'>}
# With as_supervised: tf.data.Dataset only contains (feature, label) tuples
ds_all_supervised = builder.as_dataset(as_supervised=True)
assert isinstance(ds_all_supervised, dict)
print(ds_all_supervised.keys()) # ==> ['test', 'train', 'unsupervised']
assert isinstance(ds_all_supervised['test'], tf.data.Dataset)
# Each dataset (test, train, unsup.) consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
# <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
# <tf.Tensor: ... dtype=int64, numpy=1>)
# Same as above plus requesting a particular split
ds_test_supervised = builder.as_dataset(as_supervised=True, split='test')
assert isinstance(ds_test_supervised, tf.data.Dataset)
# The dataset consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
# <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
# <tf.Tensor: ... dtype=int64, numpy=1>)
Args | |
---|---|
split
|
Which split of the data to load (e.g. 'train' , 'test' ,
['train', 'test'] , 'train[80%:]' ,...). See our
split API guide.
If None , will return all splits in a Dict[Split, tf.data.Dataset] .
|
batch_size
|
int , batch size. Note that variable-length features will
be 0-padded if batch_size is set. Users that want more custom behavior
should use batch_size=None and use the tf.data API to construct a
custom pipeline. If batch_size == -1 , will return feature
dictionaries of the whole dataset with tf.Tensor s instead of a
tf.data.Dataset .
|
shuffle_files
|
bool , whether to shuffle the input files. Defaults to
False .
|
decoders
|
Nested dict of Decoder objects which allow to customize the
decoding. The structure should match the feature structure, but only
customized feature keys need to be present. See
the guide
for more info.
|
read_config
|
tfds.ReadConfig , Additional options to configure the
input pipeline (e.g. seed, num parallel reads,...).
|
as_supervised
|
bool , if True , the returned tf.data.Dataset
will have a 2-tuple structure (input, label) according to
builder.info.supervised_keys . If False , the default,
the returned tf.data.Dataset will have a dictionary with all the
features.
|
Returns | |
---|---|
tf.data.Dataset , or if split=None , dict<key: tfds.Split, value:
tfds.data.Dataset> .
If |
download_and_prepare
download_and_prepare(
*, download_dir=None, download_config=None
)
Downloads and prepares dataset for reading.
Args | |
---|---|
download_dir
|
str , directory where downloaded files are stored.
Defaults to "~/tensorflow-datasets/downloads".
|
download_config
|
tfds.download.DownloadConfig , further configuration for
downloading and preparing dataset.
|
Raises | |
---|---|
IOError
|
if there is not enough disk space available. |
_generate_examples
@abc.abstractmethod
_generate_examples( **kwargs ) -> split_builder_lib.SplitGenerator
Default function to generate examples for each split.
The function should return a collection of (key, examples)
. Examples
will be encoded are written to disk. See yields
section for details.
The function can return/yield:
- A python generator:
def _generate_examples(self, path):
for filepath in path.iterdir():
yield filepath.name, {'image': ..., 'label': ...}
- A
beam.PTransform
of (input_types: [] -> output_types:KeyExample
): For big datasets and distributed generation. See our Apache Beam datasets guide for more info.
def _generate_examples(self, path):
return (
beam.Create(path.iterdir())
| beam.Map(lambda filepath: filepath.name, {'image': ..., ...})
)
- A
beam.PCollection
: This should only be used if you need to share some distributed processing accross splits. In this case, you can use the following pattern:
def _split_generators(self, dl_manager, pipeline):
...
# Distributed processing shared across splits
pipeline |= beam.Create(path.iterdir())
pipeline |= 'SharedPreprocessing' >> beam.Map(_common_processing)
...
# Wrap the pipeline inside a ptransform_fn to add `'label' >> ` and avoid
# duplicated PTransform nodes names.
generate_examples = beam.ptransform_fn(self._generate_examples)
return {
'train': pipeline | 'train' >> generate_examples(is_train=True)
'test': pipeline | 'test' >> generate_examples(is_train=False)
}
def _generate_examples(self, pipeline, is_train: bool):
return pipeline | beam.Map(_split_specific_processing, is_train=is_train)
Args | |
---|---|
**kwargs
|
Arguments from the _split_generators
|
Yields | |
---|---|
key
|
str or int , a unique deterministic example identification key.
|
example
|
dict<str feature_name, feature_value> , a feature dictionary
ready to be encoded and written to disk. The example will be
encoded with self.info.features.encode_example({...}) .
|
_info
@abc.abstractmethod
_info()
Returns the tfds.core.DatasetInfo
object.
This function is called once and the result is cached for all following calls.
Returns | |
---|---|
dataset_info
|
The dataset metadata. |
_split_generators
@abc.abstractmethod
_split_generators( dl_manager:
tfds.download.DownloadManager
) -> Dict[splits_lib.Split, split_builder_lib.SplitGenerator]
Downloads the data and returns dataset splits with associated examples.
Example:
def _split_generators(self, dl_manager):
path = dl_manager.download_and_extract('http://dataset.org/my_data.zip')
return {
'train': self._generate_examples(path=path / 'train_imgs'),
'test': self._generate_examples(path=path / 'test_imgs'),
}
- If the original dataset do not have predefined
train
,test
,... splits, this function should only returns a singletrain
split here. Users can use the subsplit API to create subsplits (e.g.tfds.load(..., split=['train[:75%]', 'train[75%:]'])
). tfds.download.DownloadManager
caches downloads, so callingdownload
on the same url multiple times only download it once.- A good practice is to download all data in this function, and have all the
computation inside
_generate_examples
. - Splits are generated in the order defined here.
builder.info.splits
keep the same order. - This function can have an extra
pipeline
kwarg only if some beam preprocessing should be shared across splits. In this case, a dict ofbeam.PCollection
should be returned. See_generate_example
for details.
Args | |
---|---|
dl_manager
|
tfds.download.DownloadManager used to download/extract the
data
|
Returns | |
---|---|
The dict of split name, generators. See _generate_examples for details
about the generator format.
|
Class Variables | |
---|---|
BUILDER_CONFIGS | |
MANUAL_DOWNLOAD_INSTRUCTIONS |
None
|
RELEASE_NOTES | |
SUPPORTED_VERSIONS | |
VERSION |
None
|
builder_configs | |
code_path | |
name |
'generator_based_builder'
|
url_infos |
None
|