Attend the Women in ML Symposium on December 7 Register now

tfds.dataset_builders.TfDataBuilder

Stay organized with collections Save and categorize content based on your preferences.

DatasetBuilder that builds a TFDS dataset from a tf.data.Dataset.

Inherits From: GeneratorBasedBuilder, DatasetBuilder

This class can be used to create a new dataset builder class, but also in an adhoc manner, e.g. from a notebook.

If you are in a notebook and you want to transform a tf.data.Dataset into a TFDS dataset, then you can do so as follows:

import tensorflow as tf
import tensorflow_datasets.public_api as tfds

my_ds_train = tf.data.Dataset.from_tensor_slices({"number": [1, 2, 3]})
my_ds_test = tf.data.Dataset.from_tensor_slices({"number": [4, 5]})

# Optionally define a custom `data_dir`. If None, then the default data dir is
# used.
custom_data_dir = "/my/folder"

# Define the builder.
builder = tfds.dataset_builders.TfDataBuilder(
    name="my_dataset",
    config="single_number",
    version="1.0.0",
    data_dir=custom_data_dir,
    split_datasets={
        "train": my_ds_train,
        "test": my_ds_test,
    },
    features=tfds.features.FeaturesDict({
        "number": tfds.features.Scalar(dtype=tf.int64),
    }),
    description="My dataset with a single number.",
    release_notes={
        "1.0.0": "Initial release with numbers up to 5!",
    }
)

# Make the builder store the data as a TFDS dataset.
builder.download_and_prepare()

The config argument is optional and can be useful if you want to store different configs under the same dataset.

The data_dir argument can be used to store the generated TFDS dataset in a different folder, for example in your own sandbox if you don't want to share this with others (yet). Note that when doing this, you also need to pass the data_dir to tfds.load. If the data_dir argument is not specified, then the default TFDS data dir will be used.

After the TFDS dataset has been stored, it can be loaded from other scripts:

# If no custom data dir was specified:
ds_test = tfds.load("my_dataset/single_number", split="test")

# When there are multiple versions, you can also specify the version.
ds_test = tfds.load("my_dataset/single_number:1.0.0", split="test")

# If the TFDS was stored in a custom folder, then it can be loaded as follows:
custom_data_dir = "/my/folder"
ds_test = tfds.load("my_dataset/single_number:1.0.0", split="test",
data_dir=custom_data_dir)

You can also define a new DatasetBuilder based on this class.

import tensorflow as tf
import tensorflow_datasets.public_api as tfds

class MyDatasetBuilder(tfds.dataset_builders.TfDataBuilder):
  def __init__(self):
    ds_train = tf.data.Dataset.from_tensor_slices([1, 2, 3])
    ds_test = tf.data.Dataset.from_tensor_slices([4, 5])
    super().__init__(
      name="my_dataset",
      version="1.0.0",
      split_datasets={
          "train": ds_train,
          "test": ds_test,
      },
      features=tfds.features.FeaturesDict({
          "number": tfds.features.Scalar(dtype=tf.int64),
      }),
      config="single_number",
      description="My dataset with a single number.",
      release_notes={
          "1.0.0": "Initial release with numbers up to 5!",
      }
    )

file_format EXPERIMENTAL, may change at any time; Format of the record files in which dataset will be read/written to. If None, defaults to tfrecord.
**kwargs Arguments passed to DatasetBuilder.

builder_config tfds.core.BuilderConfig for this builder.
canonical_version

data_dir

data_path

info tfds.core.DatasetInfo for this builder.
release_notes

supported_versions

version

versions Versions (canonical + availables), in preference order.

Methods

as_dataset

View source

Constructs a tf.data.Dataset.

Callers must pass arguments as keyword arguments.

The output types vary depending on the parameters. Examples:

builder = tfds.builder('imdb_reviews')
builder.download_and_prepare()

# Default parameters: Returns the dict of tf.data.Dataset
ds_all_dict = builder.as_dataset()
assert isinstance(ds_all_dict, dict)
print(ds_all_dict.keys())  # ==> ['test', 'train', 'unsupervised']

assert isinstance(ds_all_dict['test'], tf.data.Dataset)
# Each dataset (test, train, unsup.) consists of dictionaries
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
#  'text': <tf.Tensor: .. dtype=string, numpy=b"I've watched the movie ..">}
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
#  'text': <tf.Tensor: .. dtype=string, numpy=b'If you love Japanese ..'>}

# With as_supervised: tf.data.Dataset only contains (feature, label) tuples
ds_all_supervised = builder.as_dataset(as_supervised=True)
assert isinstance(ds_all_supervised, dict)
print(ds_all_supervised.keys())  # ==> ['test', 'train', 'unsupervised']

assert isinstance(ds_all_supervised['test'], tf.data.Dataset)
# Each dataset (test, train, unsup.) consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)

# Same as above plus requesting a particular split
ds_test_supervised = builder.as_dataset(as_supervised=True, split='test')
assert isinstance(ds_test_supervised, tf.data.Dataset)
# The dataset consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)

Args
split Which split of the data to load (e.g. 'train', 'test', ['train', 'test'], 'train[80%:]',...). See our split API guide. If None, will return all splits in a Dict[Split, tf.data.Dataset].
batch_size int, batch size. Note that variable-length features will be 0-padded if batch_size is set. Users that want more custom behavior should use batch_size=None and use the tf.data API to construct a custom pipeline. If batch_size == -1, will return feature dictionaries of the whole dataset with tf.Tensors instead of a tf.data.Dataset.
shuffle_files bool, whether to shuffle the input files. Defaults to False.
decoders Nested dict of Decoder objects which allow to customize the decoding. The structure should match the feature structure, but only customized feature keys need to be present. See the guide for more info.
read_config tfds.ReadConfig, Additional options to configure the input pipeline (e.g. seed, num parallel reads,...).
as_supervised bool, if True, the returned tf.data.Dataset will have a 2-tuple structure (input, label) according to builder.info.supervised_keys. If False, the default, the returned tf.data.Dataset will have a dictionary with all the features.

Returns
tf.data.Dataset, or if split=None, dict<key: tfds.Split, value: tfds.data.Dataset>.

If batch_size is -1, will return feature dictionaries containing the entire dataset in tf.Tensors instead of a tf.data.Dataset.

download_and_prepare

View source

Downloads and prepares dataset for reading.

Args
download_dir str, directory where downloaded files are stored. Defaults to "~/tensorflow-datasets/downloads".
download_config tfds.download.DownloadConfig, further configuration for downloading and preparing dataset.
file_format optional str or file_adapters.FileFormat, format of the record files in which the dataset will be written.

Raises
IOError if there is not enough disk space available.
RuntimeError when the config cannot be found.

get_default_builder_config

View source

Returns the default builder config if there is one.

Note that for dataset builders that cannot use the cls.BUILDER_CONFIGS, we need a method that uses the instance to get BUILDER_CONFIGS and DEFAULT_BUILDER_CONFIG_NAME.

Returns
the default builder config if there is one

BUILDER_CONFIGS []
DEFAULT_BUILDER_CONFIG_NAME None
MANUAL_DOWNLOAD_INSTRUCTIONS None
RELEASE_NOTES

{

}

SUPPORTED_VERSIONS []
VERSION None
builder_config_cls None
builder_configs

{

}

code_path Instance of etils.epath.gpath.PosixGPath
default_builder_config None
name 'tf_data_builder'
url_infos None