TensorFlow is back at Google I/O on May 14! Register now

tf.data.Dataset

Represents a potentially large set of elements.

tf.data.Dataset(
    variant_tensor
)

Used in the notebooks

Used in the guide	Used in the tutorials
tf.data: Build TensorFlow input pipelines Better performance with the tf.data API Extension types Migrate from Estimator to Keras APIs Distributed training with TensorFlow	Distributed Input Parameter server training with ParameterServerStrategy Load CSV data Custom training with tf.distribute.Strategy pix2pix: Image-to-image translation with a conditional GAN

The tf.data.Dataset API supports writing descriptive and efficient input pipelines. Dataset usage follows a common pattern:

Create a source dataset from your input data.
Apply dataset transformations to preprocess the data.
Iterate over the dataset and process the elements.

Iteration happens in a streaming fashion, so the full dataset does not need to fit into memory.

Source Datasets:

The simplest way to create a dataset is to create it from a python list:

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset:
  print(element)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)

To process lines from files, use tf.data.TextLineDataset:

dataset = tf.data.TextLineDataset(["file1.txt", "file2.txt"])

To process records written in the TFRecord format, use TFRecordDataset:

dataset = tf.data.TFRecordDataset(["file1.tfrecords", "file2.tfrecords"])

To create a dataset of all files matching a pattern, use tf.data.Dataset.list_files:

dataset = tf.data.Dataset.list_files("/path/*.txt")

See tf.data.FixedLengthRecordDataset and tf.data.Dataset.from_generator for more ways to create datasets.

Transformations:

Once you have a dataset, you can apply transformations to prepare the data for your model:

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
dataset = dataset.map(lambda x: x*2)
list(dataset.as_numpy_iterator())
[2, 4, 6]

Common Terms:

Element: A single output from calling next() on a dataset iterator. Elements may be nested structures containing multiple components. For example, the element (1, (3, "apple")) has one tuple nested in another tuple. The components are 1, 3, and "apple".

Component: The leaf in the nested structure of an element.

Supported types:

Elements can be nested structures of tuples, named tuples, and dictionaries. Note that Python lists are not treated as nested structures of components. Instead, lists are converted to tensors and treated as components. For example, the element (1, [1, 2, 3]) has only two components; the tensor 1 and the tensor [1, 2, 3]. Element components can be of any type representable by tf.TypeSpec, including tf.Tensor, tf.data.Dataset, tf.sparse.SparseTensor, tf.RaggedTensor, and tf.TensorArray.

a = 1 # Integer element
b = 2.0 # Float element
c = (1, 2) # Tuple element with 2 components
d = {"a": (2, 2), "b": 3} # Dict element with 3 components
Point = collections.namedtuple("Point", ["x", "y"])
e = Point(1, 2) # Named tuple
f = tf.data.Dataset.range(10) # Dataset element

For more information, read this guide.

Args
`variant_tensor`	A DT_VARIANT tensor that represents the dataset.

Attributes
`element_spec`	The type specification of an element of this dataset. `dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])` `dataset.element_spec` `TensorSpec(shape=(), dtype=tf.int32, name=None)` For more information, read this guide.

Attributes

element_spec

The type specification of an element of this dataset.

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
dataset.element_spec
TensorSpec(shape=(), dtype=tf.int32, name=None)

For more information, read this guide.

Raises
`TypeError`	if an element contains a non-`Tensor` value.
`RuntimeError`	if eager execution is not enabled.

Args
`batch_size`	A `tf.int64` scalar `tf.Tensor`, representing the number of consecutive elements of this dataset to combine in a single batch.
`drop_remainder`	(Optional.) A `tf.bool` scalar `tf.Tensor`, representing whether the last batch should be dropped in the case it has fewer than `batch_size` elements; the default behavior is not to drop the smaller batch.
`num_parallel_calls`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the number of batches to compute asynchronously in parallel. If not specified, batches will be computed sequentially. If the value `tf.data.AUTOTUNE` is used, then the number of parallel calls is set dynamically based on available resources.
`deterministic`	(Optional.) When `num_parallel_calls` is specified, if this boolean is specified (`True` or `False`), it controls the order in which the transformation produces elements. If set to `False`, the transformation is allowed to yield elements out of order to trade determinism for performance. If not specified, the `tf.data.Options.deterministic` option (`True` by default) controls the behavior.
`name`	(Optional.) A name for the tf.data operation.

Args
`element_length_func`	function from element in `Dataset` to `tf.int32`, determines the length of the element, which will determine the bucket it goes into.
`bucket_boundaries`	`list<int>`, upper length boundaries of the buckets.
`bucket_batch_sizes`	`list<int>`, batch size per bucket. Length should be `len(bucket_boundaries) + 1`.
`padded_shapes`	Nested structure of `tf.TensorShape` to pass to `tf.data.Dataset.padded_batch`. If not provided, will use `dataset.output_shapes`, which will result in variable length dimensions being padded out to the maximum length in each batch.
`padding_values`	Values to pad with, passed to `tf.data.Dataset.padded_batch`. Defaults to padding with 0.
`pad_to_bucket_boundary`	bool, if `False`, will pad dimensions with unknown size to maximum length in batch. If `True`, will pad dimensions with unknown size to bucket boundary minus 1 (i.e., the maximum length in each bucket), and caller must ensure that the source `Dataset` does not contain any elements with length longer than `max(bucket_boundaries)`.
`no_padding`	`bool`, indicates whether to pad the batch features (features need to be either of type `tf.sparse.SparseTensor` or of same shape).
`drop_remainder`	(Optional.) A `tf.bool` scalar `tf.Tensor`, representing whether the last batch should be dropped in the case it has fewer than `batch_size` elements; the default behavior is not to drop the smaller batch.
`name`	(Optional.) A name for the tf.data operation.

Args
`filename`	A `tf.string` scalar `tf.Tensor`, representing the name of a directory on the filesystem to use for caching elements in this Dataset. If a filename is not provided, the dataset will be cached in memory.
`name`	(Optional.) A name for the tf.data operation.

Args
`datasets`	A non-empty list of `tf.data.Dataset` objects with compatible structure.
`choice_dataset`	A `tf.data.Dataset` of scalar `tf.int64` tensors between `0` and `len(datasets) - 1`.
`stop_on_empty_dataset`	If `True`, selection stops if it encounters an empty dataset. If `False`, it skips empty datasets. It is recommended to set it to `True`. Otherwise, the selected elements start off as the user intends, but may change as input datasets become empty. This can be difficult to detect since the dataset starts off looking correct. Defaults to `True`.

Raises
`TypeError`	If `datasets` or `choice_dataset` has the wrong type.
`ValueError`	If `datasets` is empty.

Args
`dataset`	`Dataset` to be concatenated.
`name`	(Optional.) A name for the tf.data operation.

Args
`start`	(Optional.) The starting value for the counter. Defaults to 0.
`step`	(Optional.) The step size for the counter. Defaults to 1.
`dtype`	(Optional.) The data type for counter elements. Defaults to `tf.int64`.
`name`	(Optional.) A name for the tf.data operation.

Args
`predicate`	A function mapping a dataset element to a boolean.
`name`	(Optional.) A name for the tf.data operation.

Args
`map_func`	A function mapping a dataset element to a dataset.
`name`	(Optional.) A name for the tf.data operation.

Args
`generator`	A callable object that returns an object that supports the `iter()` protocol. If `args` is not specified, `generator` must take no arguments; otherwise it must take as many arguments as there are values in `args`.
`output_types`	(Optional.) A (nested) structure of `tf.DType` objects corresponding to each component of an element yielded by `generator`.
`output_shapes`	(Optional.) A (nested) structure of `tf.TensorShape` objects corresponding to each component of an element yielded by `generator`.
`args`	(Optional.) A tuple of `tf.Tensor` objects that will be evaluated and passed to `generator` as NumPy-array arguments.
`output_signature`	(Optional.) A (nested) structure of `tf.TypeSpec` objects corresponding to each component of an element yielded by `generator`.
`name`	(Optional.) A name for the tf.data operations used by `from_generator`.

Args
`tensors`	A dataset element, whose components have the same first dimension. Supported values are documented here.
`name`	(Optional.) A name for the tf.data operation.

Args
`tensors`	A dataset "element". Supported values are documented here.
`name`	(Optional.) A name for the tf.data operation.

Args
`key_func`	A function mapping a nested structure of tensors (having shapes and types defined by `self.output_shapes` and `self.output_types`) to a scalar `tf.int64` tensor.
`reduce_func`	A function mapping a key and a dataset of up to `window_size` consecutive elements matching that key to another dataset.
`window_size`	A `tf.int64` scalar `tf.Tensor`, representing the number of consecutive elements matching the same key to combine in a single batch, which will be passed to `reduce_func`. Mutually exclusive with `window_size_func`.
`window_size_func`	A function mapping a key to a `tf.int64` scalar `tf.Tensor`, representing the number of consecutive elements matching the same key to combine in a single batch, which will be passed to `reduce_func`. Mutually exclusive with `window_size`.
`name`	(Optional.) A name for the tf.data operation.

Args
`log_warning`	(Optional.) A bool indicating whether or not ignored errors should be logged to stderr. Defaults to `False`.
`name`	(Optional.) A string indicating a name for the `tf.data` operation.

Args
`map_func`	A function that takes a dataset element and returns a `tf.data.Dataset`.
`cycle_length`	(Optional.) The number of input elements that will be processed concurrently. If not set, the tf.data runtime decides what it should be based on available CPU. If `num_parallel_calls` is set to `tf.data.AUTOTUNE`, the `cycle_length` argument identifies the maximum degree of parallelism.
`block_length`	(Optional.) The number of consecutive elements to produce from each input element before cycling to another input element. If not set, defaults to 1.
`num_parallel_calls`	(Optional.) If specified, the implementation creates a threadpool, which is used to fetch inputs from cycle elements asynchronously and in parallel. The default behavior is to fetch inputs from cycle elements synchronously with no parallelism. If the value `tf.data.AUTOTUNE` is used, then the number of parallel calls is set dynamically based on available CPU.
`deterministic`	(Optional.) When `num_parallel_calls` is specified, if this boolean is specified (`True` or `False`), it controls the order in which the transformation produces elements. If set to `False`, the transformation is allowed to yield elements out of order to trade determinism for performance. If not specified, the `tf.data.Options.deterministic` option (`True` by default) controls the behavior.
`name`	(Optional.) A name for the tf.data operation.

Args
`file_pattern`	A string, a list of strings, or a `tf.Tensor` of string type (scalar or vector), representing the filename glob (i.e. shell wildcard) pattern(s) that will be matched.
`shuffle`	(Optional.) If `True`, the file names will be shuffled randomly. Defaults to `True`.
`seed`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the random seed that will be used to create the distribution. See `tf.random.set_seed` for behavior.
`name`	Optional. A name for the tf.data operations used by `list_files`.

Args
`path`	Required. A path pointing to a previously saved dataset.
`element_spec`	Optional. A nested structure of `tf.TypeSpec` objects matching the structure of an element of the saved dataset and specifying the type of individual element components. If not provided, the nested structure of `tf.TypeSpec` saved with the saved dataset is used. Note that this argument is required in graph mode.
`compression`	Optional. The algorithm to use to decompress the data when reading it. Supported options are `GZIP` and `NONE`. Defaults to `NONE`.
`reader_func`	Optional. A function to control how to read data from shards. If present, the function will be traced and executed as graph computation.

Raises
`FileNotFoundError`	If `element_spec` is not specified and the saved nested structure of `tf.TypeSpec` can not be located with the saved dataset.
`ValueError`	If `element_spec` is not specified and the method is executed in graph mode.

Raises
`ValueError`	If a component has an unknown rank, and the `padded_shapes` argument is not set.
`TypeError`	If a component is of an unsupported type. The list of supported types is documented in https://www.tensorflow.org/guide/data#dataset_structure

Args
`buffer_size`	A `tf.int64` scalar `tf.Tensor`, representing the maximum number of elements that will be buffered when prefetching. If the value `tf.data.AUTOTUNE` is used, then the buffer size is dynamically tuned.
`name`	Optional. A name for the tf.data transformation.

tf.data.Dataset

Used in the notebooks

Source Datasets:

Transformations:

Common Terms:

Supported types:

Args

Attributes

Methods

apply

as_numpy_iterator

batch

bucket_by_sequence_length

cache

cardinality

choose_from_datasets

concatenate

counter

enumerate

filter

fingerprint

flat_map

The type signature is:

from_generator

from_tensor_slices

from_tensors

get_single_element

Keras

group_by_window

ignore_errors

interleave

The type signature is:

For example:

list_files

load

Example usage:

map

options

padded_batch

prefetch

ragged_batch

Example:

random

range

rebatch

reduce

rejection_resample

repeat

sample_from_datasets

save

scan

shard

Important caveats:

shuffle

Fully shuffling all the data

skip

snapshot

sparse_batch

take

take_while

unbatch

unique

window

For example:

Shift

Stride

Nested elements

The type signature is:

Flatten a dataset of windows

with_options

zip

__bool__

__iter__

__len__

__nonzero__

`apply`

`as_numpy_iterator`

`batch`

`bucket_by_sequence_length`

`cache`

`cardinality`

`choose_from_datasets`

`concatenate`

`counter`

`enumerate`

`filter`

`fingerprint`

`flat_map`

`from_generator`

`from_tensor_slices`

`from_tensors`

`get_single_element`

`group_by_window`

`ignore_errors`

`interleave`

`list_files`

`load`

`map`

`options`

`padded_batch`

`prefetch`

`ragged_batch`

`random`

`range`

`rebatch`

`reduce`

`rejection_resample`

`repeat`

`sample_from_datasets`

`save`

`scan`

`shard`

`shuffle`

`skip`

`snapshot`

`sparse_batch`

`take`

`take_while`

`unbatch`

`unique`

`window`

`with_options`

`zip`

`bool`

`iter`

`len`

`nonzero`

Args
`seed`	(Optional) If specified, the dataset produces a deterministic sequence of values.
`rerandomize_each_iteration`	(Optional) If set to False, the dataset generates the same sequence of random numbers for each epoch. If set to True, it generates a different deterministic sequence of random numbers for each epoch. It is defaulted to False if left unspecified.
`name`	(Optional.) A name for the tf.data operation.

Args
`batch_size`	A `tf.int64` scalar or vector, representing the size of batches to produce. If this argument is a vector, these values are cycled through in round robin fashion.
`drop_remainder`	(Optional.) A `tf.bool` scalar `tf.Tensor`, representing whether the last batch should be dropped in the case it has fewer than `batch_size[cycle_index]` elements; the default behavior is not to drop the smaller batch.
`name`	(Optional.) A name for the tf.data operation.

Args
`initial_state`	An element representing the initial state of the transformation.
`reduce_func`	A function that maps `(old_state, input_element)` to `new_state`. It must take two arguments and return a new element The structure of `new_state` must match the structure of `initial_state`.
`name`	(Optional.) A name for the tf.data operation.

Args
`class_func`	A function mapping an element of the input dataset to a scalar `tf.int32` tensor. Values should be in `[0, num_classes)`.
`target_dist`	A floating point type tensor, shaped `[num_classes]`.
`initial_dist`	(Optional.) A floating point type tensor, shaped `[num_classes]`. If not provided, the true class distribution is estimated live in a streaming fashion.
`seed`	(Optional.) Python integer seed for the resampler.
`name`	(Optional.) A name for the tf.data operation.

Args
`count`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the number of times the dataset should be repeated. The default behavior (if `count` is `None` or `-1`) is for the dataset be repeated indefinitely.
`name`	(Optional.) A name for the tf.data operation.

Args
`datasets`	A non-empty list of `tf.data.Dataset` objects with compatible structure.
`weights`	(Optional.) A list or Tensor of `len(datasets)` floating-point values where `weights[i]` represents the probability to sample from `datasets[i]`, or a `tf.data.Dataset` object where each element is such a list. Defaults to a uniform distribution across `datasets`.
`seed`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the random seed that will be used to create the distribution. See `tf.random.set_seed` for behavior.
`stop_on_empty_dataset`	If `True`, sampling stops if it encounters an empty dataset. If `False`, it continues sampling and skips any empty datasets. It is recommended to set it to `True`. Otherwise, the distribution of samples starts off as the user intends, but may change as input datasets become empty. This can be difficult to detect since the dataset starts off looking correct. Default to `False` for backward compatibility.
`rerandomize_each_iteration`	An optional `bool`. The boolean argument controls whether the sequence of random numbers used to determine which dataset to sample from will be rerandomized each epoch. That is, it determinies whether datasets will be sampled in the same order across different epochs (the default behavior) or not.

Args
`path`	Required. A directory to use for saving the dataset.
`compression`	Optional. The algorithm to use to compress data when writing it. Supported options are `GZIP` and `NONE`. Defaults to `NONE`.
`shard_func`	Optional. A function to control the mapping of dataset elements to file shards. The function is expected to map elements of the input dataset to int64 shard IDs. If present, the function will be traced and executed as graph computation.
`checkpoint_args`	Optional args for checkpointing which will be passed into the `tf.train.CheckpointManager`. If `checkpoint_args` are not specified, then checkpointing will not be performed. The `save()` implementation creates a `tf.train.Checkpoint` object internally, so users should not set the `checkpoint` argument in `checkpoint_args`.

Args
`initial_state`	A nested structure of tensors, representing the initial state of the accumulator.
`scan_func`	A function that maps `(old_state, input_element)` to `(new_state, output_element)`. It must take two arguments and return a pair of nested structures of tensors. The `new_state` must match the structure of `initial_state`.
`name`	(Optional.) A name for the tf.data operation.

Args
`num_shards`	A `tf.int64` scalar `tf.Tensor`, representing the number of shards operating in parallel.
`index`	A `tf.int64` scalar `tf.Tensor`, representing the worker index.
`name`	(Optional.) A name for the tf.data operation.

Args
`path`	Required. A directory to use for storing / loading the snapshot to / from.
`compression`	Optional. The type of compression to apply to the snapshot written to disk. Supported options are `GZIP`, `SNAPPY`, `AUTO` or None. Defaults to `AUTO`, which attempts to pick an appropriate compression algorithm for the dataset.
`reader_func`	Optional. A function to control how to read data from snapshot shards.
`shard_func`	Optional. A function to control how to shard data when writing a snapshot.
`name`	(Optional.) A name for the tf.data operation.