ML Community Day is November 9! Join us for updates from TensorFlow, JAX, and more Learn more

tf.data.experimental.make_csv_dataset

Reads CSV files into a dataset.

Used in the notebooks

Used in the guide Used in the tutorials

Reads CSV files into a dataset, where each element of the dataset is a (features, labels) tuple that corresponds to a batch of CSV rows. The features dictionary maps feature column names to Tensors containing the corresponding feature data, and labels is a Tensor containing the batch's label data.

By default, the first rows of the CSV files are expected to be headers listing the column names. If the first rows are not headers, set header=False and provide the column names with the column_names argument.

By default, the dataset is repeated indefinitely, reshuffling the order each time. This behavior can be modified by setting the num_epochs and shuffle arguments.

For example, suppose you have a CSV file containing

Feature_A Feature_B
1 "a"
2 "b"
3 "c"
4 "d"
# No label column specified
dataset = tf.data.experimental.make_csv_dataset(filename, batch_size=2)
iterator = ds.as_numpy_iterator()
print(dict(next(iterator)))
# prints a dictionary of batched features:
# OrderedDict([('Feature_A', array([1, 4], dtype=int32)),
#              ('Feature_B', array([b'a', b'd'], dtype=object))])
# Set Feature_B as label column
dataset = tf.data.experimental.make_csv_dataset(
    filename, batch_size=2, label_name="Feature_B")
iterator = ds.as_numpy_iterator()
print(next(iterator))
# prints (features, labels) tuple:
# (OrderedDict([('Feature_A', array([1, 2], dtype=int32))]),
#  array([b'a', b'b'], dtype=object))

See the Load CSV data guide for more examples of using make_csv_dataset to read CSV data.

file_pattern List of files or patterns of file paths containing CSV records. See tf.io.gfile.glob for pattern rules.
batch_size An int representing the number of records to combine in a single batch.
column_names An optional list of strings that corresponds to the CSV columns, in order. One per column of the input record. If this is not provided, infers the column names from the first row of the records. These names will be the keys of the features dict of each dataset element.
column_defaults A optional list of default values for the CSV fields. One item per selected column of the input record. Each item in the list is either a valid CSV dtype (float32, float64, int32, int64, or string), or a Tensor with one of the aforementioned types. The tensor can either be a scalar default value (if the column is optional), or an empty tensor (if the column is required). If a dtype is provided instead of a tensor, the column is also treated as required. If this list is not provided, tries to infer types based on reading the first num_rows_for_inference rows of files specified, and assumes all columns are optional, defaulting to 0 for numeric values and "" for string values. If both this and select_columns are specified, these must have the same lengths, and column_defaults is assumed to be sorted in order of increasing column index.
label_name A optional string corresponding to the label column. If provided, the data for this column is returned as a separate Tensor from the features dictionary, so that the dataset complies with the format expected by a tf.Estimator.train or tf.Estimator.evaluate input function.
select_columns An optional list of integer indices or string column names, that specifies a subset of columns of CSV data to select. If column names are provided, these must correspond to names provided in column_names or inferred from the file header lines. When this argument is specified, only a subset of CSV columns will be parsed and returned, corresponding to the columns specified. Using this results in faster parsing and lower memory usage. If both this and column_defaults are specified, these must have the same lengths, and