![]() |
![]() |
A Dataset comprising lines from one or more CSV files.
Inherits From: Dataset
tf.data.experimental.CsvDataset(
filenames, record_defaults, compression_type=None, buffer_size=None,
header=False, field_delim=',', use_quote_delim=True,
na_value='', select_cols=None, exclude_cols=None
)
Used in the notebooks
Used in the guide | Used in the tutorials |
---|---|
Args | |
---|---|
filenames
|
A tf.string tensor containing one or more filenames.
|
record_defaults
|
A list of default values for the CSV fields. Each item in
the list is either a valid CSV DType (float32, float64, int32, int64,
string), or a Tensor object with one of the above types. One per
column of CSV data, with either a scalar Tensor default value for the
column if it is optional, or DType or empty Tensor if required. If
both this and select_columns are specified, these must have the same
lengths, and column_defaults is assumed to be sorted in order of
increasing column index. If both this and 'exclude_cols' are specified,
the sum of lengths of record_defaults and exclude_cols should equal
the total number of columns in the CSV file.
|
compression_type
|
(Optional.) A tf.string scalar evaluating to one of
"" (no compression), "ZLIB" , or "GZIP" . Defaults to no
compression.
|
buffer_size
|
(Optional.) A tf.int64 scalar denoting the number of bytes
to buffer while reading files. Defaults to 4MB.
|
header
|
(Optional.) A tf.bool scalar indicating whether the CSV file(s)
have header line(s) that should be skipped when parsing. Defaults to
False .
|
field_delim
|
(Optional.) A tf.string scalar containing the delimiter
character that separates fields in a record. Defaults to "," .
|
use_quote_delim
|
(Optional.) A tf.bool scalar. If False , treats
double quotation marks as regular characters inside of string fields
(ignoring RFC 4180, Section 2, Bullet 5). Defaults to True .
|
na_value
|
(Optional.) A tf.string scalar indicating a value that will
be treated as NA/NaN.
|
select_cols
|
(Optional.) A sorted list of column indices to select from
the input data. If specified, only this subset of columns will be
parsed. Defaults to parsing all columns. At most one of select_cols
and exclude_cols can be specified.
|
exclude_cols
|
(Optional.) A sorted list of column indices to exclude from
the input data. If specified, only the complement of this set of column
will be parsed. Defaults to parsing all columns. At most one of
select_cols and exclude_cols can be specified.
|
Raises | |
---|---|
InvalidArgumentError
|
If exclude_cols is not None and len(exclude_cols) + len(record_defaults) does not match the total number of columns in the file(s) |
Attributes | |
---|---|
element_spec
|
The type specification of an element of this dataset.
|
Methods
apply
apply(
transformation_func
)
Applies a transformation function to this dataset.
apply
enables chaining of custom Dataset
transformations, which are
represented as functions that take one Dataset
argument and return a
transformed Dataset
.
dataset = tf.data.Dataset.range(100)
def dataset_fn(ds):
return ds.filter(lambda x: x < 5)
dataset = dataset.apply(dataset_fn)
list(dataset.as_numpy_iterator())
[0, 1, 2, 3, 4]
Args | |
---|---|
transformation_func
|
A function that takes one Dataset argument and
returns a Dataset .
|
Returns | |
---|---|
Dataset
|
The Dataset returned by applying transformation_func to this
dataset.
|
as_numpy_iterator
as_numpy_iterator()
Returns an iterator which converts all elements of the dataset to numpy.
Use as_numpy_iterator
to inspect the content of your dataset. To see
element shapes and types, print dataset elements directly instead of using
as_numpy_iterator
.
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset:
print(element)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
This method requires that you are running in eager mode and the dataset's
element_spec contains only TensorSpec
components.
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset.as_numpy_iterator():
print(element)
1
2
3
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
print(list(dataset.as_numpy_iterator()))
[1, 2, 3]
as_numpy_iterator()
will preserve the nested structure of dataset
elements.
dataset = tf.data.Dataset.from_tensor_slices({'a': ([1, 2], [3, 4]),
'b': [5, 6]})
list(dataset.as_numpy_iterator()) == [{'a': (1, 3), 'b': 5},
{'a': (2, 4), 'b': 6}]
True
Returns | |
---|---|
An iterable over the elements of the dataset, with their tensors converted to numpy arrays. |
Raises | |
---|---|
TypeError
|
if an element contains a non-Tensor value.
|
RuntimeError
|
if eager execution is not enabled. |
batch
batch(
batch_size, drop_remainder=False
)
Combines consecutive elements of this dataset into batches.
dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3)
list(dataset.as_numpy_iterator())
[array([0, 1, 2]), array([3, 4, 5]), array([6, 7])]
dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3, drop_remainder=True)
list(dataset.as_numpy_iterator())
[array([0, 1, 2]), array([3, 4, 5])]
The components of the resulting element will have an additional outer
dimension, which will be batch_size
(or N % batch_size
for the last
element if batch_size
does not divide the number of input elements N
evenly and drop_remainder
is False
). If your program depends on the
batches having the same outer dimension, you should set the drop_remainder
argument to True
to prevent the smaller batch from being produced.
Args | |
---|---|
|