TensorFlow is back at Google I/O on May 14! Register now

tf.data.experimental.CsvDataset

A Dataset comprising lines from one or more CSV files.

Inherits From: Dataset

tf.data.experimental.CsvDataset(
    filenames,
    record_defaults,
    compression_type=None,
    buffer_size=None,
    header=False,
    field_delim=',',
    use_quote_delim=True,
    na_value='',
    select_cols=None,
    exclude_cols=None
)

Used in the notebooks

Used in the guide	Used in the tutorials
tf.data: Build TensorFlow input pipelines	Load CSV data Overfit and underfit

The tf.data.experimental.CsvDataset class provides a minimal CSV Dataset interface. There is also a richer tf.data.experimental.make_csv_dataset function which provides additional convenience features such as column header parsing, column type-inference, automatic shuffling, and file interleaving.

The elements of this dataset correspond to records from the file(s). RFC 4180 format is expected for CSV files (https://tools.ietf.org/html/rfc4180) Note that we allow leading and trailing spaces for int or float fields.

For example, suppose we have a file 'my_file0.csv' with four CSV columns of different data types:

with open('/tmp/my_file0.csv', 'w') as f:
  f.write('abcdefg,4.28E10,5.55E6,12\n')
  f.write('hijklmn,-5.3E14,,2\n')

We can construct a CsvDataset from it as follows:

dataset = tf.data.experimental.CsvDataset(
  "/tmp/my_file0.csv",
  [tf.float32,  # Required field, use dtype or empty tensor
   tf.constant([0.0], dtype=tf.float32),  # Optional field, default to 0.0
   tf.int32,  # Required field, use dtype or empty tensor
  ],
  select_cols=[1,2,3]  # Only parse last three columns
)

The expected output of its iterations is:

for element in dataset.as_numpy_iterator():
  print(element)
(4.28e10, 5.55e6, 12)
(-5.3e14, 0.0, 2)

See https://www.tensorflow.org/tutorials/load_data/csv#tfdataexperimentalcsvdataset for more in-depth example usage.

Args
`filenames`	A `tf.string` tensor containing one or more filenames.
`record_defaults`	A list of default values for the CSV fields. Each item in the list is either a valid CSV `DType` (float32, float64, int32, int64, string), or a `Tensor` object with one of the above types. One per column of CSV data, with either a scalar `Tensor` default value for the column if it is optional, or `DType` or empty `Tensor` if required. If both this and `select_columns` are specified, these must have the same lengths, and `column_defaults` is assumed to be sorted in order of increasing column index. If both this and 'exclude_cols' are specified, the sum of lengths of record_defaults and exclude_cols should equal the total number of columns in the CSV file.
`compression_type`	(Optional.) A `tf.string` scalar evaluating to one of `""` (no compression), `"ZLIB"`, or `"GZIP"`. Defaults to no compression.
`buffer_size`	(Optional.) A `tf.int64` scalar denoting the number of bytes to buffer while reading files. Defaults to 4MB.
`header`	(Optional.) A `tf.bool` scalar indicating whether the CSV file(s) have header line(s) that should be skipped when parsing. Defaults to `False`.
`field_delim`	(Optional.) A `tf.string` scalar containing the delimiter character that separates fields in a record. Defaults to `","`.
`use_quote_delim`	(Optional.) A `tf.bool` scalar. If `False`, treats double quotation marks as regular characters inside of string fields (ignoring RFC 4180, Section 2, Bullet 5). Defaults to `True`.
`na_value`	(Optional.) A `tf.string` scalar indicating a value that will be treated as NA/NaN.
`select_cols`	(Optional.) A sorted list of column indices to select from the input data. If specified, only this subset of columns will be parsed. Defaults to parsing all columns. At most one of `select_cols` and `exclude_cols` can be specified.
`exclude_cols`	(Optional.) A sorted list of column indices to exclude from the input data. If specified, only the complement of this set of column will be parsed. Defaults to parsing all columns. At most one of `select_cols` and `exclude_cols` can be specified.

Raises
`InvalidArgumentError`	If exclude_cols is not None and len(exclude_cols) + len(record_defaults) does not match the total number of columns in the file(s)

Attributes
`element_spec`	The type specification of an element of this dataset. `dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])` `dataset.element_spec` `TensorSpec(shape=(), dtype=tf.int32, name=None)` For more information, read this guide.

Attributes

element_spec

The type specification of an element of this dataset.

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
dataset.element_spec
TensorSpec(shape=(), dtype=tf.int32, name=None)

For more information, read this guide.

Raises
`TypeError`	if an element contains a non-`Tensor` value.
`RuntimeError`	if eager execution is not enabled.

Args
`batch_size`	A `tf.int64` scalar `tf.Tensor`, representing the number of consecutive elements of this dataset to combine in a single batch.
`drop_remainder`	(Optional.) A `tf.bool` scalar `tf.Tensor`, representing whether the last batch should be dropped in the case it has fewer than `batch_size` elements; the default behavior is not to drop the smaller batch.
`num_parallel_calls`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the number of batches to compute asynchronously in parallel. If not specified, batches will be computed sequentially. If the value `tf.data.AUTOTUNE` is used, then the number of parallel calls is set dynamically based on available resources.
`deterministic`	(Optional.) When `num_parallel_calls` is specified, if this boolean is specified (`True` or `False`), it controls the order in which the transformation produces elements. If set to `False`, the transformation is allowed to yield elements out of order to trade determinism for performance. If not specified, the `tf.data.Options.deterministic` option (`True` by default) controls the behavior.
`name`	(Optional.) A name for the tf.data operation.

Args
`element_length_func`	function from element in `Dataset` to `tf.int32`, determines the length of the element, which will determine the bucket it goes into.
`bucket_boundaries`	`list<int>`, upper length boundaries of the buckets.
`bucket_batch_sizes`	`list<int>`, batch size per bucket. Length should be `len(bucket_boundaries) + 1`.
`padded_shapes`	Nested structure of `tf.TensorShape` to pass to `tf.data.Dataset.padded_batch`. If not provided, will use `dataset.output_shapes`, which will result in variable length dimensions being padded out to the maximum length in each batch.
`padding_values`	Values to pad with, passed to `tf.data.Dataset.padded_batch`. Defaults to padding with 0.
`pad_to_bucket_boundary`	bool, if `False`, will pad dimensions with unknown size to maximum length in batch. If `True`, will pad dimensions with unknown size to bucket boundary minus 1 (i.e., the maximum length in each bucket), and caller must ensure that the source `Dataset` does not contain any elements with length longer than `max(bucket_boundaries)`.
`no_padding`	`bool`, indicates whether to pad the batch features (features need to be either of type `tf.sparse.SparseTensor` or of same shape).
`drop_remainder`	(Optional.) A `tf.bool` scalar `tf.Tensor`, representing whether the last batch should be dropped in the case it has fewer than `batch_size` elements; the default behavior is not to drop the smaller batch.
`name`	(Optional.) A name for the tf.data operation.

Args
`filename`	A `tf.string` scalar `tf.Tensor`, representing the name of a directory on the filesystem to use for caching elements in this Dataset. If a filename is not provided, the dataset will be cached in memory.
`name`	(Optional.) A name for the tf.data operation.

Args
`datasets`	A non-empty list of `tf.data.Dataset` objects with compatible structure.
`choice_dataset`	A `tf.data.Dataset` of scalar `tf.int64` tensors between `0` and `len(datasets) - 1`.
`stop_on_empty_dataset`	If `True`, selection stops if it encounters an empty dataset. If `False`, it skips empty datasets. It is recommended to set it to `True`. Otherwise, the selected elements start off as the user intends, but may change as input datasets become empty. This can be difficult to detect since the dataset starts off looking correct. Defaults to `True`.

Raises
`TypeError`	If `datasets` or `choice_dataset` has the wrong type.
`ValueError`	If `datasets` is empty.

Args
`dataset`	`Dataset` to be concatenated.
`name`	(Optional.) A name for the tf.data operation.

Args
`start`	(Optional.) The starting value for the counter. Defaults to 0.
`step`	(Optional.) The step size for the counter. Defaults to 1.
`dtype`	(Optional.) The data type for counter elements. Defaults to `tf.int64`.
`name`	(Optional.) A name for the tf.data operation.

Args
`predicate`	A function mapping a dataset element to a boolean.
`name`	(Optional.) A name for the tf.data operation.

Args
`map_func`	A function mapping a dataset element to a dataset.
`name`	(Optional.) A name for the tf.data operation.

Args
`generator`	A callable object that returns an object that supports the `iter()` protocol. If `args` is not specified, `generator` must take no arguments; otherwise it must take as many arguments as there are values in `args`.
`output_types`	(Optional.) A (nested) structure of `tf.DType` objects corresponding to each component of an element yielded by `generator`.
`output_shapes`	(Optional.) A (nested) structure of `tf.TensorShape` objects corresponding to each component of an element yielded by `generator`.
`args`	(Optional.) A tuple of `tf.Tensor` objects that will be evaluated and passed to `generator` as NumPy-array arguments.
`output_signature`	(Optional.) A (nested) structure of `tf.TypeSpec` objects corresponding to each component of an element yielded by `generator`.
`name`	(Optional.) A name for the tf.data operations used by `from_generator`.

Args
`tensors`	A dataset element, whose components have the same first dimension. Supported values are documented here.
`name`	(Optional.) A name for the tf.data operation.

Args
`tensors`	A dataset "element". Supported values are documented here.
`name`	(Optional.) A name for the tf.data operation.

Args
`key_func`	A function mapping a nested structure of tensors (having shapes and types defined by `self.output_shapes` and `self.output_types`) to a scalar `tf.int64` tensor.
`reduce_func`	A function mapping a key and a dataset of up to `window_size` consecutive elements matching that key to another dataset.
`window_size`	A `tf.int64` scalar `tf.Tensor`, representing the number of consecutive elements matching the same key to combine in a single batch, which will be passed to `reduce_func`. Mutually exclusive with `window_size_func`.
`window_size_func`	A function mapping a key to a `tf.int64` scalar `tf.Tensor`, representing the number of consecutive elements matching the same key to combine in a single batch, which will be passed to `reduce_func`. Mutually exclusive with `window_size`.
`name`	(Optional.) A name for the tf.data operation.

Args
`log_warning`	(Optional.) A bool indicating whether or not ignored errors should be logged to stderr. Defaults to `False`.
`name`	(Optional.) A string indicating a name for the `tf.data` operation.

Args
`map_func`	A function that takes a dataset element and returns a `tf.data.Dataset`.
`cycle_length`	(Optional.) The number of input elements that will be processed concurrently. If not set, the tf.data runtime decides what it should be based on available CPU. If `num_parallel_calls` is set to `tf.data.AUTOTUNE`, the `cycle_length` argument identifies the maximum degree of parallelism.
`block_length`	(Optional.) The number of consecutive elements to produce from each input element before cycling to another input element. If not set, defaults to 1.
`num_parallel_calls`	(Optional.) If specified, the implementation creates a threadpool, which is used to fetch inputs from cycle elements asynchronously and in parallel. The default behavior is to fetch inputs from cycle elements synchronously with no parallelism. If the value `tf.data.AUTOTUNE` is used, then the number of parallel calls is set dynamically based on available CPU.
`deterministic`	(Optional.) When `num_parallel_calls` is specified, if this boolean is specified (`True` or `False`), it controls the order in which the transformation produces elements. If set to `False`, the transformation is allowed to yield elements out of order to trade determinism for performance. If not specified, the `tf.data.Options.deterministic` option (`True` by default) controls the behavior.
`name`	(Optional.) A name for the tf.data operation.

Args
`file_pattern`	A string, a list of strings, or a `tf.Tensor` of string type (scalar or vector), representing the filename glob (i.e. shell wildcard) pattern(s) that will be matched.
`shuffle`	(Optional.) If `True`, the file names will be shuffled randomly. Defaults to `True`.
`seed`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the random seed that will be used to create the distribution. See `tf.random.set_seed` for behavior.
`name`	Optional. A name for the tf.data operations used by `list_files`.

Args
`path`	Required. A path pointing to a previously saved dataset.
`element_spec`	Optional. A nested structure of `tf.TypeSpec` objects matching the structure of an element of the saved dataset and specifying the type of individual element components. If not provided, the nested structure of `tf.TypeSpec` saved with the saved dataset is used. Note that this argument is required in graph mode.
`compression`	Optional. The algorithm to use to decompress the data when reading it. Supported options are `GZIP` and `NONE`. Defaults to `NONE`.
`reader_func`	Optional. A function to control how to read data from shards. If present, the function will be traced and executed as graph computation.

Raises
`FileNotFoundError`	If `element_spec` is not specified and the saved nested structure of `tf.TypeSpec` can not be located with the saved dataset.
`ValueError`	If `element_spec` is not specified and the method is executed in graph mode.

Raises
`ValueError`	If a component has an unknown rank, and the `padded_shapes` argument is not set.
`TypeError`	If a component is of an unsupported type. The list of supported types is documented in https://www.tensorflow.org/guide/data#dataset_structure

Args
`seed`	(Optional) If specified, the dataset produces a deterministic sequence of values.
`rerandomize_each_iteration`	(Optional) If set to False, the dataset generates the same sequence of random numbers for each epoch. If set to True, it generates a different deterministic sequence of random numbers for each epoch. It is defaulted to False if left unspecified.
`name`	(Optional.) A name for the tf.data operation.

tf.data.experimental.CsvDataset

Used in the notebooks

Args

Raises

Attributes

Methods

apply

as_numpy_iterator

batch

bucket_by_sequence_length

cache

cardinality

choose_from_datasets

concatenate

counter

enumerate

filter

fingerprint

flat_map

The type signature is:

from_generator

from_tensor_slices

from_tensors

get_single_element

Keras

group_by_window

ignore_errors

interleave

The type signature is:

For example:

list_files

load

Example usage:

map

options

padded_batch

prefetch

ragged_batch

Example:

random

range

rebatch

reduce

rejection_resample

repeat

sample_from_datasets

save

scan

shard

Important caveats:

shuffle

Fully shuffling all the data

skip

snapshot

sparse_batch

take

take_while

unbatch

unique

window

For example:

Shift

Stride

Nested elements

The type signature is:

Flatten a dataset of windows

with_options

zip

__bool__

__iter__

__len__

__nonzero__

`apply`

`as_numpy_iterator`

`batch`

`bucket_by_sequence_length`

`cache`

`cardinality`

`choose_from_datasets`

`concatenate`

`counter`

`enumerate`

`filter`

`fingerprint`

`flat_map`

`from_generator`

`from_tensor_slices`

`from_tensors`

`get_single_element`

`group_by_window`

`ignore_errors`

`interleave`

`list_files`

`load`

`map`

`options`

`padded_batch`

`prefetch`

`ragged_batch`

`random`

`range`

`rebatch`

`reduce`

`rejection_resample`

`repeat`

`sample_from_datasets`

`save`

`scan`

`shard`

`shuffle`

`skip`

`snapshot`

`sparse_batch`

`take`

`take_while`

`unbatch`

`unique`

`window`

`with_options`

`zip`

`bool`

`iter`

`len`

`nonzero`

Args
`batch_size`	A `tf.int64` scalar or vector, representing the size of batches to produce. If this argument is a vector, these values are cycled through in round robin fashion.
`drop_remainder`	(Optional.) A `tf.bool` scalar `tf.Tensor`, representing whether the last batch should be dropped in the case it has fewer than `batch_size[cycle_index]` elements; the default behavior is not to drop the smaller batch.
`name`	(Optional.) A name for the tf.data operation.

Args
`initial_state`	An element representing the initial state of the transformation.
`reduce_func`	A function that maps `(old_state, input_element)` to `new_state`. It must take two arguments and return a new element The structure of `new_state` must match the structure of `initial_state`.
`name`	(Optional.) A name for the tf.data operation.

Args
`class_func`	A function mapping an element of the input dataset to a scalar `tf.int32` tensor. Values should be in `[0, num_classes)`.
`target_dist`	A floating point type tensor, shaped `[num_classes]`.
`initial_dist`	(Optional.) A floating point type tensor, shaped `[num_classes]`. If not provided, the true class distribution is estimated live in a streaming fashion.
`seed`	(Optional.) Python integer seed for the resampler.
`name`	(Optional.) A name for the tf.data operation.

Args
`count`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the number of times the dataset should be repeated. The default behavior (if `count` is `None` or `-1`) is for the dataset be repeated indefinitely.
`name`	(Optional.) A name for the tf.data operation.

Args
`datasets`	A non-empty list of `tf.data.Dataset` objects with compatible structure.
`weights`	(Optional.) A list or Tensor of `len(datasets)` floating-point values where `weights[i]` represents the probability to sample from `datasets[i]`, or a `tf.data.Dataset` object where each element is such a list. Defaults to a uniform distribution across `datasets`.
`seed`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the random seed that will be used to create the distribution. See `tf.random.set_seed` for behavior.
`stop_on_empty_dataset`	If `True`, sampling stops if it encounters an empty dataset. If `False`, it continues sampling and skips any empty datasets. It is recommended to set it to `True`. Otherwise, the distribution of samples starts off as the user intends, but may change as input datasets become empty. This can be difficult to detect since the dataset starts off looking correct. Default to `False` for backward compatibility.
`rerandomize_each_iteration`	An optional `bool`. The boolean argument controls whether the sequence of random numbers used to determine which dataset to sample from will be rerandomized each epoch. That is, it determinies whether datasets will be sampled in the same order across different epochs (the default behavior) or not.

Args
`path`	Required. A directory to use for saving the dataset.
`compression`	Optional. The algorithm to use to compress data when writing it. Supported options are `GZIP` and `NONE`. Defaults to `NONE`.
`shard_func`	Optional. A function to control the mapping of dataset elements to file shards. The function is expected to map elements of the input dataset to int64 shard IDs. If present, the function will be traced and executed as graph computation.
`checkpoint_args`	Optional args for checkpointing which will be passed into the `tf.train.CheckpointManager`. If `checkpoint_args` are not specified, then checkpointing will not be performed. The `save()` implementation creates a `tf.train.Checkpoint` object internally, so users should not set the `checkpoint` argument in `checkpoint_args`.

Args
`initial_state`	A nested structure of tensors, representing the initial state of the accumulator.
`scan_func`	A function that maps `(old_state, input_element)` to `(new_state, output_element)`. It must take two arguments and return a pair of nested structures of tensors. The `new_state` must match the structure of `initial_state`.
`name`	(Optional.) A name for the tf.data operation.

Args
`num_shards`	A `tf.int64` scalar `tf.Tensor`, representing the number of shards operating in parallel.
`index`	A `tf.int64` scalar `tf.Tensor`, representing the worker index.
`name`	(Optional.) A name for the tf.data operation.

Args
`path`	Required. A directory to use for storing / loading the snapshot to / from.
`compression`	Optional. The type of compression to apply to the snapshot written to disk. Supported options are `GZIP`, `SNAPPY`, `AUTO` or None. Defaults to `AUTO`, which attempts to pick an appropriate compression algorithm for the dataset.
`reader_func`	Optional. A function to control how to read data from snapshot shards.
`shard_func`	Optional. A function to control how to shard data when writing a snapshot.
`name`	(Optional.) A name for the tf.data operation.

Args
`batch_size`	A `tf.int64` scalar `tf.Tensor`, representing the number of consecutive elements of this dataset to combine in a single batch.
`row_shape`	A `tf.TensorShape` or `tf.int64` vector tensor-like object representing the equivalent dense shape of a row in the resulting `tf.sparse.SparseTensor`. Each element of this dataset must have the same rank as `row_shape`, and must have size less than or equal to `row_shape` in each dimension.
`name`	(Optional.) A string indicating a name for the `tf.data` operation.