tf.data.Dataset

Represents a potentially large set of elements.

tf.data.Dataset(
    variant_tensor
)

The tf.data.Dataset API supports writing descriptive and efficient input pipelines. Dataset usage follows a common pattern:

Create a source dataset from your input data.
Apply dataset transformations to preprocess the data.
Iterate over the dataset and process the elements.

Iteration happens in a streaming fashion, so the full dataset does not need to fit into memory.

Source Datasets:

The simplest way to create a dataset is to create it from a python list:

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset:
  print(element)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)

To process lines from files, use tf.data.TextLineDataset:

dataset = tf.data.TextLineDataset(["file1.txt", "file2.txt"])

To process records written in the TFRecord format, use TFRecordDataset:

dataset = tf.data.TFRecordDataset(["file1.tfrecords", "file2.tfrecords"])

To create a dataset of all files matching a pattern, use tf.data.Dataset.list_files:

dataset = tf.data.Dataset.list_files("/path/*.txt")

See tf.data.FixedLengthRecordDataset and tf.data.Dataset.from_generator for more ways to create datasets.

Transformations:

Once you have a dataset, you can apply transformations to prepare the data for your model:

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
dataset = dataset.map(lambda x: x*2)
list(dataset.as_numpy_iterator())
[2, 4, 6]

Common Terms:

Element: A single output from calling next() on a dataset iterator. Elements may be nested structures containing multiple components. For example, the element (1, (3, "apple")) has one tuple nested in another tuple. The components are 1, 3, and "apple".

Component: The leaf in the nested structure of an element.

Supported types:

Elements can be nested structures of tuples, named tuples, and dictionaries. Note that Python lists are not treated as nested structures of components. Instead, lists are converted to tensors and treated as components. For example, the element (1, [1, 2, 3]) has only two components; the tensor 1 and the tensor [1, 2, 3]. Element components can be of any type representable by tf.TypeSpec, including tf.Tensor, tf.data.Dataset, tf.sparse.SparseTensor, tf.RaggedTensor, and tf.TensorArray.

a = 1 # Integer element
b = 2.0 # Float element
c = (1, 2) # Tuple element with 2 components
d = {"a": (2, 2), "b": 3} # Dict element with 3 components
Point = collections.namedtuple("Point", ["x", "y"])
e = Point(1, 2) # Named tuple
f = tf.data.Dataset.range(10) # Dataset element

For more information, read this guide.

Args
`variant_tensor`	A DT_VARIANT tensor that represents the dataset.

Attributes
`element_spec`	The type specification of an element of this dataset. `dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])` `dataset.element_spec` `TensorSpec(shape=(), dtype=tf.int32, name=None)` For more information, read this guide.

Attributes

element_spec

The type specification of an element of this dataset.

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
dataset.element_spec
TensorSpec(shape=(), dtype=tf.int32, name=None)

For more information, read this guide.

Raises
`TypeError`	if an element contains a non-`Tensor` value.
`RuntimeError`	if eager execution is not enabled.

Args
`batch_size`	A `tf.int64` scalar `tf.Tensor`, representing the number of consecutive elements of this dataset to combine in a single batch.
`drop_remainder`	(Optional.) A `tf.bool` scalar `tf.Tensor`, representing whether the last batch should be dropped in the case it has fewer than `batch_size` elements; the default behavior is not to drop the smaller batch.
`num_parallel_calls`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the number of batches to compute asynchronously in parallel. If not specified, batches will be computed sequentially. If the value `tf.data.AUTOTUNE` is used, then the number of parallel calls is set dynamically based on available resources.
`deterministic`	(Optional.) When `num_parallel_calls` is specified, if this boolean is specified (`True` or `False`), it controls the order in which the transformation produces elements. If set to `False`, the transformation is allowed to yield elements out of order to trade determinism for performance. If not specified, the `tf.data.Options.experimental_deterministic` option (`True` by default) controls the behavior.

Args
`element_length_func`	function from element in `Dataset` to `tf.int32`, determines the length of the element, which will determine the bucket it goes into.
`bucket_boundaries`	`list<int>`, upper length boundaries of the buckets.
`bucket_batch_sizes`	`list<int>`, batch size per bucket. Length should be `len(bucket_boundaries) + 1`.
`padded_shapes`	Nested structure of `tf.TensorShape` to pass to `tf.data.Dataset.padded_batch`. If not provided, will use `dataset.output_shapes`, which will result in variable length dimensions being padded out to the maximum length in each batch.
`padding_values`	Values to pad with, passed to `tf.data.Dataset.padded_batch`. Defaults to padding with 0.
`pad_to_bucket_boundary`	bool, if `False`, will pad dimensions with unknown size to maximum length in batch. If `True`, will pad dimensions with unknown size to bucket boundary minus 1 (i.e., the maximum length in each bucket), and caller must ensure that the source `Dataset` does not contain any elements with length longer than `max(bucket_boundaries)`.
`no_padding`	`bool`, indicates whether to pad the batch features (features need to be either of type `tf.sparse.SparseTensor` or of same shape).
`drop_remainder`	(Optional.) A `tf.bool` scalar `tf.Tensor`, representing whether the last batch should be dropped in the case it has fewer than `batch_size` elements; the default behavior is not to drop the smaller batch.

Args
`generator`	A callable object that returns an object that supports the `iter()` protocol. If `args` is not specified, `generator` must take no arguments; otherwise it must take as many arguments as there are values in `args`.
`output_types`	(Optional.) A (nested) structure of `tf.DType` objects corresponding to each component of an element yielded by `generator`.
`output_shapes`	(Optional.) A (nested) structure of `tf.TensorShape` objects corresponding to each component of an element yielded by `generator`.
`args`	(Optional.) A tuple of `tf.Tensor` objects that will be evaluated and passed to `generator` as NumPy-array arguments.
`output_signature`	(Optional.) A (nested) structure of `tf.TypeSpec` objects corresponding to each component of an element yielded by `generator`.

Args
`key_func`	A function mapping a nested structure of tensors (having shapes and types defined by `self.output_shapes` and `self.output_types`) to a scalar `tf.int64` tensor.
`reduce_func`	A function mapping a key and a dataset of up to `window_size` consecutive elements matching that key to another dataset.
`window_size`	A `tf.int64` scalar `tf.Tensor`, representing the number of consecutive elements matching the same key to combine in a single batch, which will be passed to `reduce_func`. Mutually exclusive with `window_size_func`.
`window_size_func`	A function mapping a key to a `tf.int64` scalar `tf.Tensor`, representing the number of consecutive elements matching the same key to combine in a single batch, which will be passed to `reduce_func`. Mutually exclusive with `window_size`.

Args
`map_func`	A function mapping a dataset element to a dataset.
`cycle_length`	(Optional.) The number of input elements that will be processed concurrently. If not set, the tf.data runtime decides what it should be based on available CPU. If `num_parallel_calls` is set to `tf.data.AUTOTUNE`, the `cycle_length` argument identifies the maximum degree of parallelism.
`block_length`	(Optional.) The number of consecutive elements to produce from each input element before cycling to another input element. If not set, defaults to 1.
`num_parallel_calls`	(Optional.) If specified, the implementation creates a threadpool, which is used to fetch inputs from cycle elements asynchronously and in parallel. The default behavior is to fetch inputs from cycle elements synchronously with no parallelism. If the value `tf.data.AUTOTUNE` is used, then the number of parallel calls is set dynamically based on available CPU.
`deterministic`	(Optional.) When `num_parallel_calls` is specified, if this boolean is specified (`True` or `False`), it controls the order in which the transformation produces elements. If set to `False`, the transformation is allowed to yield elements out of order to trade determinism for performance. If not specified, the `tf.data.Options.experimental_deterministic` option (`True` by default) controls the behavior.

Args
`file_pattern`	A string, a list of strings, or a `tf.Tensor` of string type (scalar or vector), representing the filename glob (i.e. shell wildcard) pattern(s) that will be matched.
`shuffle`	(Optional.) If `True`, the file names will be shuffled randomly. Defaults to `True`.
`seed`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the random seed that will be used to create the distribution. See `tf.random.set_seed` for behavior.

Args
`initial_state`	An element representing the initial state of the transformation.
`reduce_func`	A function that maps `(old_state, input_element)` to `new_state`. It must take two arguments and return a new element The structure of `new_state` must match the structure of `initial_state`.

Args
`initial_state`	A nested structure of tensors, representing the initial state of the accumulator.
`scan_func`	A function that maps `(old_state, input_element)` to `(new_state, output_element)`. It must take two arguments and return a pair of nested structures of tensors. The `new_state` must match the structure of `initial_state`.

Args
`num_shards`	A `tf.int64` scalar `tf.Tensor`, representing the number of shards operating in parallel.
`index`	A `tf.int64` scalar `tf.Tensor`, representing the worker index.

Args
`buffer_size`	A `tf.int64` scalar `tf.Tensor`, representing the number of elements from this dataset from which the new dataset will sample.
`seed`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the random seed that will be used to create the distribution. See `tf.random.set_seed` for behavior.
`reshuffle_each_iteration`	(Optional.) A boolean, which if true indicates that the dataset should be pseudorandomly reshuffled each time it is iterated over. (Defaults to `True`.)

Args
`path`	Required. A directory to use for storing / loading the snapshot to / from.
`compression`	Optional. The type of compression to apply to the snapshot written to disk. Supported options are `GZIP`, `SNAPPY`, `AUTO` or None. Defaults to `AUTO`, which attempts to pick an appropriate compression algorithm for the dataset.
`reader_func`	Optional. A function to control how to read data from snapshot shards.
`shard_func`	Optional. A function to control how to shard data when writing a snapshot.

Args
`size`	A `tf.int64` scalar `tf.Tensor`, representing the number of elements of the input dataset to combine into a window. Must be positive.
`shift`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the number of input elements by which the window moves in each iteration. Defaults to `size`. Must be positive.
`stride`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the stride of the input elements in the sliding window. Must be positive. The default value of 1 means "retain every input element".
`drop_remainder`	(Optional.) A `tf.bool` scalar `tf.Tensor`, representing whether the last windows should be dropped if their size is smaller than `size`.

tf.data.Dataset Stay organized with collections Save and categorize content based on your preferences.

Source Datasets:

Transformations:

Common Terms:

Supported types:

Args

Attributes

Methods

apply

as_numpy_iterator

batch

bucket_by_sequence_length

cache

cardinality

concatenate

enumerate

filter

flat_map

The type signature is:

from_generator

from_tensor_slices

from_tensors

get_single_element

Keras

Estimator

group_by_window

interleave

The type signature is:

For example:

list_files

Example:

map

options

padded_batch

prefetch

random

range

reduce

repeat

scan

shard

Important caveats:

shuffle

skip

snapshot

take

take_while

unbatch

unique

window

For example:

Shift

Stride

Nested elements

The type signature is:

Flatten a dataset of windows

with_options

zip

__bool__

__iter__

__len__

__nonzero__

tf.data.Dataset

`apply`

`as_numpy_iterator`

`batch`

`bucket_by_sequence_length`

`cache`

`cardinality`

`concatenate`

`enumerate`

`filter`

`flat_map`

`from_generator`

`from_tensor_slices`

`from_tensors`

`get_single_element`

`group_by_window`

`interleave`

`list_files`

`map`

`options`

`padded_batch`

`prefetch`

`random`

`range`

`reduce`

`repeat`

`scan`

`shard`

`shuffle`

`skip`

`snapshot`

`take`

`take_while`

`unbatch`

`unique`

`window`

`with_options`

`zip`

`bool`

`iter`

`len`

`nonzero`