TensorFlow 2.0 Beta is available Learn more

Load CSV with tf.data

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

This tutorial provides an example of how to load CSV data from a file into a tf.data.Dataset.

The data used in this tutorial are taken from the Titanic passenger list. The model will predict the likelihood a passenger survived based on characteristics like age, gender, ticket class, and whether the person was traveling alone.

Setup

!pip install -q tensorflow==2.0.0-beta1
from __future__ import absolute_import, division, print_function, unicode_literals
import functools

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
32768/30874 [===============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv
16384/13049 [=====================================] - 0s 0us/step
# Make numpy values easier to read.
np.set_printoptions(precision=3, suppress=True)

Load data

To start, let's look at the top of the CSV file to see how it is formatted.

!head {train_file_path}
survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n

As you can see, the columns in the CSV are named. The dataset constructor will pick these names up automatically. If the file you are working with does not contain the column names in the first line, pass them in a list of strings to the column_names argument in the make_csv_dataset function.


CSV_COLUMNS = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']

dataset = tf.data.experimental.make_csv_dataset(
     ...,
     column_names=CSV_COLUMNS,
     ...)
  

This example is going to use all the available columns. If you need to omit some columns from the dataset, create a list of just the columns you plan to use, and pass it into the (optional) select_columns argument of the constructor.


dataset = tf.data.experimental.make_csv_dataset(
  ...,
  select_columns = columns_to_use, 
  ...)

The only column you need to identify explicitly is the one with the value that the model is intended to predict.

LABEL_COLUMN = 'survived'
LABELS = [0, 1]

Now read the CSV data from the file and create a dataset.

(For the full documentation, see tf.data.experimental.make_csv_dataset)

def get_dataset(file_path):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=12, # Artificially small to make examples easier to show.
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True)
  return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)
WARNING: Logging before flag parsing goes to stderr.
W0713 00:51:58.834274 140099288360704 deprecation.py:323] From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow/python/data/experimental/ops/readers.py:498: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.

Each item in the dataset is a batch, represented as a tuple of (many examples, many labels). The data from the examples is organized in column-based tensors (rather than row-based tensors), each with as many elements as the batch size (12 in this case).

It might help to see this yourself.

examples, labels = next(iter(raw_train_data)) # Just the first batch.
print("EXAMPLES: \n", examples, "\n")
print("LABELS: \n", labels)
EXAMPLES: 
 OrderedDict([('sex', <tf.Tensor: id=170, shape=(12,), dtype=string, numpy=
array([b'male', b'male', b'female', b'female', b'female', b'male',
       b'female', b'male', b'male', b'female', b'male', b'female'],
      dtype=object)>), ('age', <tf.Tensor: id=162, shape=(12,), dtype=float32, numpy=
array([28., 36., 31., 22., 28., 28., 24., 16., 28., 18., 36., 28.],
      dtype=float32)>), ('n_siblings_spouses', <tf.Tensor: id=168, shape=(12,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1], dtype=int32)>), ('parch', <tf.Tensor: id=169, shape=(12,), dtype=int32, numpy=array([0, 1, 2, 1, 0, 0, 0, 3, 0, 0, 2, 0], dtype=int32)>), ('fare', <tf.Tensor: id=167, shape=(12,), dtype=float32, numpy=
array([ 35.5  , 512.329, 164.867,  55.   ,   7.75 ,   7.733,  13.   ,
        34.375,   7.225,   9.842,  27.75 ,  89.104], dtype=float32)>), ('class', <tf.Tensor: id=164, shape=(12,), dtype=string, numpy=
array([b'First', b'First', b'First', b'First', b'Third', b'Third',
       b'Second', b'Third', b'Third', b'Third', b'Second', b'First'],
      dtype=object)>), ('deck', <tf.Tensor: id=165, shape=(12,), dtype=string, numpy=
array([b'C', b'B', b'C', b'E', b'unknown', b'unknown', b'F', b'unknown',
       b'unknown', b'unknown', b'unknown', b'C'], dtype=object)>), ('embark_town', <tf.Tensor: id=166, shape=(12,), dtype=string, numpy=
array([b'Southampton', b'Cherbourg', b'Southampton', b'Southampton',
       b'Queenstown', b'Queenstown', b'Southampton', b'Southampton',
       b'Cherbourg', b'Southampton', b'Southampton', b'Cherbourg'],
      dtype=object)>), ('alone', <tf.Tensor: id=163, shape=(12,), dtype=string, numpy=
array([b'y', b'n', b'n', b'n', b'y', b'y', b'y', b'n', b'y', b'y', b'n',
       b'n'], dtype=object)>)]) 

LABELS: 
 tf.Tensor([1 1 1 1 1 0 1 0 0 1 0 1], shape=(12,), dtype=int32)

Data preprocessing

Categorical data

Some of the columns in the CSV data are categorical columns. That is, the content should be one of a limited set of options.

Use the tf.feature_column API to create a collection with a tf.feature_column.indicator_column for each categorical column.

CATEGORIES = {
    'sex': ['male', 'female'],
    'class' : ['First', 'Second', 'Third'],
    'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
    'alone' : ['y', 'n']
}
categorical_columns = []
for feature, vocab in CATEGORIES.items():
  cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature, vocabulary_list=vocab)
  categorical_columns.append(tf.feature_column.indicator_column(cat_col))
# See what you just created.
categorical_columns
[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southhampton', 'Queenstown'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]

This will be become part of a data processing input later when you build the model.

Continuous data

Continuous data needs to be normalized.

Write a function that normalizes the values and reshapes them into two-dimensional tensors.

def process_continuous_data(mean, data):
  # Normalize data
  data = tf.cast(data, tf.float32) * 1/(2*mean)
  return tf.reshape(data, [-1, 1])

Now create a collection of numeric columns. The tf.feature_columns.numeric_column API takes a normalizer_fn. Pass in a functools.partial made from the processing function, loaded with the mean of each column.

MEANS = {
    'age' : 29.631308,
    'n_siblings_spouses' : 0.545455,
    'parch' : 0.379585,
    'fare' : 34.385399
}

numerical_columns = []

for feature in MEANS.keys():
  num_col = tf.feature_column.numeric_column(feature, normalizer_fn=functools.partial(process_continuous_data, MEANS[feature]))
  numerical_columns.append(num_col)
# See what you just created.
numerical_columns
[NumericColumn(key='age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function process_continuous_data at 0x7f6b21113950>, 29.631308)),
 NumericColumn(key='parch', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function process_continuous_data at 0x7f6b21113950>, 0.379585)),
 NumericColumn(key='fare', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function process_continuous_data at 0x7f6b21113950>, 34.385399)),
 NumericColumn(key='n_siblings_spouses', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function process_continuous_data at 0x7f6b21113950>, 0.545455))]

The means based normalization used here requires knowing the means of each column ahead of time. To calculate normalized values in a continuous data stream, use TensorFlow Transform.

Create a preprocessing layer

Add the two feature column collections and pass them to tf.keras.layers.DenseFeatures to create an input layer that will handle your preprocessing.

preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numerical_columns)

Build the model

Build a tf.keras.Sequential, starting with the preprocessing_layer.

model = tf.keras.Sequential([
  preprocessing_layer,
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid'),
])

model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])

Train, evaluate, and predict

Now the model can be instantiated and trained.

train_data = raw_train_data.shuffle(500)
test_data = raw_test_data
model.fit(train_data, epochs=20)
Epoch 1/20

W0713 00:51:59.443597 140099288360704 deprecation.py:323] From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column_v2.py:2655: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0713 00:51:59.459218 140099288360704 deprecation.py:323] From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4215: IndicatorColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
W0713 00:51:59.460043 140099288360704 deprecation.py:323] From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4270: VocabularyListCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.

53/53 [==============================] - 2s 34ms/step - loss: 0.5600 - accuracy: 0.6275
Epoch 2/20
53/53 [==============================] - 0s 4ms/step - loss: 0.4412 - accuracy: 0.8214
Epoch 3/20
53/53 [==============================] - 0s 3ms/step - loss: 0.4193 - accuracy: 0.8353
Epoch 4/20
53/53 [==============================] - 0s 3ms/step - loss: 0.4075 - accuracy: 0.8518
Epoch 5/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3981 - accuracy: 0.8493
Epoch 6/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3903 - accuracy: 0.8539
Epoch 7/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3829 - accuracy: 0.8532
Epoch 8/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3767 - accuracy: 0.8508
Epoch 9/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3713 - accuracy: 0.8497
Epoch 10/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3661 - accuracy: 0.8491
Epoch 11/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3615 - accuracy: 0.8521
Epoch 12/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3571 - accuracy: 0.8586
Epoch 13/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3530 - accuracy: 0.8639
Epoch 14/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3488 - accuracy: 0.8656
Epoch 15/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3451 - accuracy: 0.8660
Epoch 16/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3415 - accuracy: 0.8684
Epoch 17/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3382 - accuracy: 0.8732
Epoch 18/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3353 - accuracy: 0.8728
Epoch 19/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3321 - accuracy: 0.8760
Epoch 20/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3289 - accuracy: 0.8801

<tensorflow.python.keras.callbacks.History at 0x7f6b0c6e3358>

Once the model is trained, you can check its accuracy on the test_data set.

test_loss, test_accuracy = model.evaluate(test_data)

print('\n\nTest Loss {}, Test Accuracy {}'.format(test_loss, test_accuracy))
     22/Unknown - 0s 22ms/step - loss: 0.4427 - accuracy: 0.8182

Test Loss 0.44273926520889456, Test Accuracy 0.8181818127632141

Use tf.keras.Model.predict to infer labels on a batch or a dataset of batches.

predictions = model.predict(test_data)

# Show some results
for prediction, survived in zip(predictions[:10], list(test_data)[0][1][:10]):
  print("Predicted survival: {:.2%}".format(prediction[0]),
        " | Actual outcome: ",
        ("SURVIVED" if bool(survived) else "DIED"))

Predicted survival: 1.30%  | Actual outcome:  DIED
Predicted survival: 86.70%  | Actual outcome:  SURVIVED
Predicted survival: 99.73%  | Actual outcome:  SURVIVED
Predicted survival: 9.72%  | Actual outcome:  DIED
Predicted survival: 7.17%  | Actual outcome:  DIED
Predicted survival: 10.12%  | Actual outcome:  DIED
Predicted survival: 2.77%  | Actual outcome:  DIED
Predicted survival: 49.21%  | Actual outcome:  DIED
Predicted survival: 89.52%  | Actual outcome:  SURVIVED
Predicted survival: 10.11%  | Actual outcome:  DIED