A tf.string tensor containing one or more filenames.
record_defaults
A list of default values for the CSV fields. Each item in
the list is either a valid CSV DType (float32, float64, int32, int64,
string), or a Tensor object with one of the above types. One per
column of CSV data, with either a scalar Tensor default value for the
column if it is optional, or DType or empty Tensor if required. If
both this and select_columns are specified, these must have the same
lengths, and column_defaults is assumed to be sorted in order of
increasing column index. If both this and 'exclude_cols' are specified,
the sum of lengths of record_defaults and exclude_cols should equal
the total number of columns in the CSV file.
compression_type
(Optional.) A tf.string scalar evaluating to one of
"" (no compression), "ZLIB", or "GZIP". Defaults to no
compression.
buffer_size
(Optional.) A tf.int64 scalar denoting the number of bytes
to buffer while reading files. Defaults to 4MB.
header
(Optional.) A tf.bool scalar indicating whether the CSV file(s)
have header line(s) that should be skipped when parsing. Defaults to
False.
field_delim
(Optional.) A tf.string scalar containing the delimiter
character that separates fields in a record. Defaults to ",".
use_quote_delim
(Optional.) A tf.bool scalar. If False, treats
double quotation marks as regular characters inside of string fields
(ignoring RFC 4180, Section 2, Bullet 5). Defaults to True.
na_value
(Optional.) A tf.string scalar indicating a value that will
be treated as NA/NaN.
select_cols
(Optional.) A sorted list of column indices to select from
the input data. If specified, only this subset of columns will be
parsed. Defaults to parsing all columns. At most one of select_cols
and exclude_cols can be specified.
exclude_cols
(Optional.) A sorted list of column indices to exclude from
the input data. If specified, only the complement of this set of column
will be parsed. Defaults to parsing all columns. At most one of
select_cols and exclude_cols can be specified.
Raises
InvalidArgumentError
If exclude_cols is not None and
len(exclude_cols) + len(record_defaults) does not match the total
number of columns in the file(s)
Attributes
element_spec
The type specification of an element of this dataset.
Applies a transformation function to this dataset.
apply enables chaining of custom Dataset transformations, which are
represented as functions that take one Dataset argument and return a
transformed Dataset.
Returns an iterator which converts all elements of the dataset to numpy.
Use as_numpy_iterator to inspect the content of your dataset. To see
element shapes and types, print dataset elements directly instead of using
as_numpy_iterator.
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])for element in dataset: print(element)tf.Tensor(1, shape=(), dtype=int32)tf.Tensor(2, shape=(), dtype=int32)tf.Tensor(3, shape=(), dtype=int32)
This method requires that you are running in eager mode and the dataset's
element_spec contains only TensorSpec components.
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])for element in dataset.as_numpy_iterator(): print(element)123
The components of the resulting element will have an additional outer
dimension, which will be batch_size (or N % batch_size for the last
element if batch_size does not divide the number of input elements N
evenly and drop_remainder is False). If your program depends on the
batches having the same outer dimension, you should set the drop_remainder
argument to True to prevent the smaller batch from being produced.
Args
batch_size
A tf.int64 scalar tf.Tensor, representing the number of
consecutive elements of this dataset to combine in a single batch.
drop_remainder
(Optional.) A tf.bool scalar tf.Tensor, representing
whether the last batch should be dropped in the case it has fewer than
batch_size elements; the default behavior is not to drop the smaller
batch.
The first time the dataset is iterated over, its elements will be cached
either in the specified file or in memory. Subsequent iterations will
use the cached data.
dataset = tf.data.Dataset.range(5)dataset = dataset.map(lambda x: x**2)dataset = dataset.cache()# The first time reading through the data will generate the data using# `range` and `map`.list(dataset.as_numpy_iterator())[0, 1, 4, 9, 16]# Subsequent iterations read from the cache.list(dataset.as_numpy_iterator())[0, 1, 4, 9, 16]
When caching to a file, the cached data will persist across runs. Even the
first iteration through the data will read from the cache file. Changing
the input pipeline before the call to .cache() will have no effect until
the cache file is removed or the filename is changed.
A tf.string scalar tf.Tensor, representing the name of a
directory on the filesystem to use for caching elements in this Dataset.
If a filename is not provided, the dataset will be cached in memory.
cardinality may return tf.data.INFINITE_CARDINALITY if the dataset
contains an infinite number of elements or tf.data.UNKNOWN_CARDINALITY if
the analysis fails to determine the number of elements in the dataset
(e.g. when the dataset source is a file).
Creates a Dataset by concatenating the given dataset with this dataset.
a = tf.data.Dataset.range(1, 4) # ==> [ 1, 2, 3 ]b = tf.data.Dataset.range(4, 8) # ==> [ 4, 5, 6, 7 ]ds = a.concatenate(b)list(ds.as_numpy_iterator())[1, 2, 3, 4, 5, 6, 7]# The input dataset and dataset to be concatenated should have the same# nested structures and output types.c = tf.data.Dataset.zip((a, b))a.concatenate(c)Traceback (most recent call last):TypeError: Two datasets to concatenate have different types<dtype: 'int64'> and (tf.int64, tf.int64)d = tf.data.Dataset.from_tensor_slices(["a", "b", "c"])a.concatenate(d)Traceback (most recent call last):TypeError: Two datasets to concatenate have different types<dtype: 'int64'> and <dtype: 'string'>
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])dataset = dataset.enumerate(start=5)for element in dataset.as_numpy_iterator(): print(element)(5, 1)(6, 2)(7, 3)
# The nested structure of the input dataset determines the structure of# elements in the resulting dataset.dataset = tf.data.Dataset.from_tensor_slices([(7, 8), (9, 10)])dataset = dataset.enumerate()for element in dataset.as_numpy_iterator(): print(element)(0, array([7, 8], dtype=int32))(1, array([ 9, 10], dtype=int32))
Args
start
A tf.int64 scalar tf.Tensor, representing the start value for
enumeration.