Oglądaj prezentacje, sesje produktowe, warsztaty i nie tylko z playlisty Google I / O See


A Dataset comprising lines from one or more CSV files.

Inherits From: Dataset

Used in the notebooks

Used in the guide Used in the tutorials

The tf.data.experimental.CsvDataset class provides a minimal CSV Dataset interface. There is also a richer tf.data.experimental.make_csv_dataset function which provides additional convenience features such as column header parsing, column type-inference, automatic shuffling, and file interleaving.

The elements of this dataset correspond to records from the file(s). RFC 4180 format is expected for CSV files (https://tools.ietf.org/html/rfc4180) Note that we allow leading and trailing spaces for int or float fields.

For example, suppose we have a file 'my_file0.csv' with four CSV columns of different data types:

with open('/tmp/my_file0.csv', 'w') as f:

We can construct a CsvDataset from it as follows:

dataset = tf.data.experimental.CsvDataset(
  [tf.float32,  # Required field, use dtype or empty tensor
   tf.constant([0.0], dtype=tf.float32),  # Optional field, default to 0.0
   tf.int32,  # Required field, use dtype or empty tensor
  select_cols=[1,2,3]  # Only parse last three columns

The expected output of its iterations is:

for element in dataset.as_numpy_iterator():
(4.28e10, 5.55e6, 12)
(-5.3e14, 0.0, 2)

See https://www.tensorflow.org/tutorials/load_data/csv#tfdataexperimentalcsvdataset for more in-depth example usage.

filenames A tf.string tensor containing one or more filenames.
record_defaults A list of default values for the CSV fields. Each item in the list is either a valid CSV DType (float32, float64, int32, int64, string), or a Tensor object with one of the above types. One per column of CSV data, with either a scalar Tensor default value for the column if it is optional, or DType or empty Tensor if required. If both this and select_columns are specified, these must have the same lengths, and column_defaults is assumed to be sorted in order of increasing column index. If both this and 'exclude_cols' are specified, the sum of lengths of record_defaults and exclude_cols should equal the total number of columns in the CSV file.
compression_type (Optional.) A tf.string scalar evaluating to one of "" (no compression), "ZLIB", or "GZIP". Defaults to no compression.
buffer_size (Optional.) A tf.int64 scalar denoting the number of bytes to buffer while reading files. Defaults to 4MB.
header (Optional.) A tf.bool scalar indicating whether the CSV file(s) have header line(s) that should be skipped when parsing. Defaults to False.
field_delim (Optional.) A tf.string scalar containing the delimiter character that separates fields in a record. Defaults to ",".
use_quote_delim (Optional.) A tf.bool scalar. If False, treats double quotation marks as regular characters inside of string fields (ignoring RFC 4180, Section 2, Bullet 5). Defaults to True.
na_value (Optional.) A tf.string scalar indicating a value that will be treated as NA/NaN.
select_cols (Optional.) A sorted list of column indices to select from the input data. If specified, only this subset of columns will be parsed. Defaults to parsing all columns. At most one of select_cols and exclude_cols can be specified.
exclude_cols (Optional.) A sorted list of column indices to exclude from the input data. If specified, only the complement of this set of column will be parsed. Defaults to parsing all columns. At most one of select_cols and exclude_cols can be specified.

InvalidArgumentError If exclude_cols is not None and len(exclude_cols) + len(record_defaults) does not match the total number of columns in the file(s)

element_spec The type specification of an element of this dataset.

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
TensorSpec(shape=(), dtype=tf.int32, name=None)

For more information, read this guide.



View source

Applies a transformation function to this dataset.

apply enables chaining of custom Dataset transformations, which are represented as functions that take one Dataset argument and return a transformed Dataset.

dataset = tf.data.Dataset.range(100)
def dataset_fn(ds):
  return ds.filter(lambda x: x < 5)
dataset = dataset.apply(dataset_fn)
[0, 1, 2, 3, 4]

transformation_func A function that takes one Dataset argument and returns a Dataset.

Dataset The Dataset returned by applying transformation_func to this dataset.


View source

Returns an iterator which converts all elements of the dataset to numpy.

Use as_numpy_iterator to inspect the content of your dataset. To see element shapes and types, print dataset elements directly instead of using as_numpy_iterator.

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset:
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)

This method requires that you are running in eager mode and the dataset's element_spec contains only TensorSpec components.

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset.as_numpy_iterator():
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
[1, 2, 3]

as_numpy_iterator() will preserve the nested structure of dataset elements.

dataset = tf.data.Dataset.from_tensor_slices({'a': ([1, 2], [3, 4]),
                                              'b': [5, 6]})
list(dataset.as_numpy_iterator()) == [{'a': (1, 3), 'b': 5},
                                      {'a': (2, 4), 'b': 6}]

An iterable over the elements of the dataset, with their tensors converted to numpy arrays.

TypeError if an element contains a non-Tensor value.
RuntimeError if eager execution is not enabled.


View source

Combines consecutive elements of this dataset into batches.

dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3)
[array([0, 1, 2]), array([3, 4, 5]), array([6, 7])]
dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3, drop_remainder=True)
[array([0, 1, 2]), array([3, 4, 5])]

The components of the resulting element will have an additional outer dimension, which will be