![]() |
A Dataset comprising lines from one or more CSV files.
tf.compat.v1.data.experimental.CsvDataset(
filenames, record_defaults, compression_type=None, buffer_size=None,
header=False, field_delim=',', use_quote_delim=True, na_value='',
select_cols=None
)
Args | |
---|---|
filenames
|
A tf.string tensor containing one or more filenames.
|
record_defaults
|
A list of default values for the CSV fields. Each item in
the list is either a valid CSV DType (float32, float64, int32, int64,
string), or a Tensor object with one of the above types. One per
column of CSV data, with either a scalar Tensor default value for the
column if it is optional, or DType or empty Tensor if required. If
both this and select_columns are specified, these must have the same
lengths, and column_defaults is assumed to be sorted in order of
increasing column index.
|
compression_type
|
(Optional.) A tf.string scalar evaluating to one of
"" (no compression), "ZLIB" , or "GZIP" . Defaults to no
compression.
|
buffer_size
|
(Optional.) A tf.int64 scalar denoting the number of bytes
to buffer while reading files. Defaults to 4MB.
|
header
|
(Optional.) A tf.bool scalar indicating whether the CSV file(s)
have header line(s) that should be skipped when parsing. Defaults to
False .
|
field_delim
|
(Optional.) A tf.string scalar containing the delimiter
character that separates fields in a record. Defaults to "," .
|
use_quote_delim
|
(Optional.) A tf.bool scalar. If False , treats
double quotation marks as regular characters inside of string fields
(ignoring RFC 4180, Section 2, Bullet 5). Defaults to True .
|
na_value
|
(Optional.) A tf.string scalar indicating a value that will
be treated as NA/NaN.
|
select_cols
|
(Optional.) A sorted list of column indices to select from the input data. If specified, only this subset of columns will be parsed. Defaults to parsing all columns. |
Attributes | |
---|---|
element_spec
|
The type specification of an element of this dataset.
|
output_classes
|
Returns the class of each component of an element of this dataset. (deprecated) |
output_shapes
|
Returns the shape of each component of an element of this dataset. (deprecated) |
output_types
|
Returns the type of each component of an element of this dataset. (deprecated) |
Methods
apply
apply(
transformation_func
)
Applies a transformation function to this dataset.
apply
enables chaining of custom Dataset
transformations, which are
represented as functions that take one Dataset
argument and return a
transformed Dataset
.
dataset = tf.data.Dataset.range(100)
def dataset_fn(ds):
return ds.filter(lambda x: x < 5)
dataset = dataset.apply(dataset_fn)
list(dataset.as_numpy_iterator())
[0, 1, 2, 3, 4]
Args | |
---|---|
transformation_func
|
A function that takes one Dataset argument and
returns a Dataset .
|
Returns | |
---|---|
Dataset
|
The Dataset returned by applying transformation_func to this
dataset.
|
as_numpy_iterator
as_numpy_iterator()
Returns an iterator which converts all elements of the dataset to numpy.
Use as_numpy_iterator
to inspect the content of your dataset. To see
element shapes and types, print dataset elements directly instead of using
as_numpy_iterator
.
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset:
print(element)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
This method requires that you are running in eager mode and the dataset's
element_spec contains only TensorSpec
components.
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset.as_numpy_iterator():
print(element)
1
2
3
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
print(list(dataset.as_numpy_iterator()))
[1, 2, 3]
as_numpy_iterator()
will preserve the nested structure of dataset
elements.
dataset = tf.data.Dataset.from_tensor_slices({'a': ([1, 2], [3, 4]),
'b': [5, 6]})
list(dataset.as_numpy_iterator()) == [{'a': (1, 3), 'b': 5},
{'a': (2, 4), 'b': 6}]
True
Returns | |
---|---|
An iterable over the elements of the dataset, with their tensors converted to numpy arrays. |
Raises | |
---|---|
TypeError
|
if an element contains a non-Tensor value.
|
RuntimeError
|
if eager execution is not enabled. |
batch
batch(
batch_size, drop_remainder=False
)
Combines consecutive elements of this dataset into batches.
dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3)
list(dataset.as_numpy_iterator())
[array([0, 1, 2]), array([3, 4, 5]), array([6, 7])]
dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3, drop_remainder=True)
list(dataset.as_numpy_iterator())
[array([0, 1, 2]), array([3, 4, 5])]
The components of the resulting element will have an additional outer
dimension, which will be batch_size
(or N % batch_size
for the last
element if batch_size
does not divide the number of input elements N
evenly and drop_remainder
is False
). If your program depends on the
batches having the same outer dimension, you should set the drop_remainder
argument to True
to prevent the smaller batch from being produced.
Args | |
---|---|
batch_size
|
A tf.int64 scalar tf.Tensor , representing the number of
consecutive elements of this dataset to combine in a single batch.
|
drop_remainder
|
(Optional.) A tf.bool scalar tf.Tensor , representing
whether the last batch should be dropped in the case it has fewer than
batch_size elements; the default behavior is not to drop the smaller
batch.
|
Returns | |
---|---|
Dataset
|
A Dataset .
|
cache
cache(
filename=''
)
Caches the elements in this dataset.
The first time the dataset is iterated over, its elements will be cached either in the specified file or in memory. Subsequent iterations will use the cached data.
dataset = tf.data.Dataset.range(5)
dataset = dataset.map(lambda x: x**2)
dataset = dataset.cache()
# The first time reading through the data will generate the data using
# `range` and `map`.
list(dataset.as_numpy_iterator())
[0, 1, 4, 9, 16]
# Subsequent iterations read from the cache.
list(dataset.as_numpy_iterator())
[0, 1, 4, 9, 16]
When caching to a file, the cached data will persist across runs. Even the
first iteration through the data will read from the cache file. Changing
the input pipeline before the call to .cache()
will have no effect until
the cache file is removed or the filename is changed.
dataset = tf.data.Dataset.range(5)
dataset = dataset.cache("/path/to/file") # doctest: +SKIP
list(dataset.as_numpy_iterator()) # doctest: +SKIP
[0, 1, 2, 3, 4]
dataset = tf.data.Dataset.range(10)
dataset = dataset.cache("/path/to/file") # Same file! # doctest: +SKIP
list(dataset.as_numpy_iterator()) # doctest: +SKIP
[0, 1, 2, 3, 4]