Load CSV with tf.data

View on TensorFlow.org Run in Google Colab View source on GitHub

This tutorial provides an example of how to load CSV data from a file into a tf.data.Dataset.

The data used in this tutorial are taken from the Titanic passenger list. We'll try to predict the likelihood a passenger survived based on characteristics like age, gender, ticket class, and whether the person was traveling alone.

Setup

!pip install -q tensorflow==2.0.0-alpha0
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
32768/30874 [===============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv
16384/13049 [=====================================] - 0s 0us/step
# Make numpy values easier to read.
np.set_printoptions(precision=3, suppress=True)

Load data

So we know what we're doing, lets look at the top of the CSV file we're working with.

!head {train_file_path}
survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n

As you can see, the columns in the CSV are labeled. We need the list later on, so let's read it out of the file.

# CSV columns in the input file.
with open(train_file_path, 'r') as f:
    names_row = f.readline()


CSV_COLUMNS = names_row.rstrip('\n').split(',')
print(CSV_COLUMNS)
['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']

The dataset constructor will pick these labels up automatically.

If the file you are working with does not contain the column names in the first line, pass them in a list of strings to the column_names argument in the make_csv_dataset function.


CSV_COLUMNS = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']

dataset = tf.data.experimental.make_csv_dataset(
     ...,
     column_names=CSV_COLUMNS,
     ...)
  

This example is going to use all the available columns. If you need to omit some columns from the dataset, create a list of just the columns you plan to use, and pass it into the (optional) select_columns argument of the constructor.


drop_columns = ['fare', 'embark_town']
columns_to_use = [col for col in CSV_COLUMNS if col not in drop_columns]

dataset = tf.data.experimental.make_csv_dataset(
  ...,
  select_columns = columns_to_use, 
  ...)

We also have to identify which column will serve as the labels for each example, and what those labels are.

LABELS = [0, 1]
LABEL_COLUMN = 'survived'

FEATURE_COLUMNS = [column for column in CSV_COLUMNS if column != LABEL_COLUMN]

Now that these constructor argument values are in place, read the CSV data from the file and create a dataset.

(For the full documentation, see tf.data.experimental.make_csv_dataset)

def get_dataset(file_path):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=12, # Artificially small to make examples easier to show.
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True)
  return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)
WARNING: Logging before flag parsing goes to stderr.
W0517 23:01:50.000339 140595026700032 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/data/experimental/ops/readers.py:499: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.

Each item in the dataset is a batch, represented as a tuple of (many examples, many labels). The data from the examples is organized in column-based tensors (rather than row-based tensors), each with as many elements as the batch size (12 in this case).

It might help to see this yourself.

examples, labels = next(iter(raw_train_data)) # Just the first batch.
print("EXAMPLES: \n", examples, "\n")
print("LABELS: \n", labels)
EXAMPLES: 
 OrderedDict([('sex', <tf.Tensor: id=170, shape=(12,), dtype=string, numpy=
array([b'male', b'male', b'female', b'female', b'female', b'male',
       b'female', b'male', b'female', b'male', b'male', b'male'],
      dtype=object)>), ('age', <tf.Tensor: id=162, shape=(12,), dtype=float32, numpy=
array([21., 22., 36., 28., 23., 26., 31., 18., 24., 28., 28., 47.],
      dtype=float32)>), ('n_siblings_spouses', <tf.Tensor: id=168, shape=(12,), dtype=int32, numpy=array([0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0], dtype=int32)>), ('parch', <tf.Tensor: id=169, shape=(12,), dtype=int32, numpy=array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>), ('fare', <tf.Tensor: id=167, shape=(12,), dtype=float32, numpy=
array([ 77.287,   7.229,  17.4  ,   8.05 , 113.275,   7.896,  18.   ,
         7.775,  49.504,   7.75 ,  47.1  ,  38.5  ], dtype=float32)>), ('class', <tf.Tensor: id=164, shape=(12,), dtype=string, numpy=
array([b'First', b'Third', b'Third', b'Third', b'First', b'Third',
       b'Third', b'Third', b'First', b'Third', b'First', b'First'],
      dtype=object)>), ('deck', <tf.Tensor: id=165, shape=(12,), dtype=string, numpy=
array([b'D', b'unknown', b'unknown', b'unknown', b'D', b'unknown',
       b'unknown', b'unknown', b'C', b'unknown', b'unknown', b'E'],
      dtype=object)>), ('embark_town', <tf.Tensor: id=166, shape=(12,), dtype=string, numpy=
array([b'Southampton', b'Cherbourg', b'Southampton', b'Southampton',
       b'Cherbourg', b'Southampton', b'Southampton', b'Southampton',
       b'Cherbourg', b'Queenstown', b'Southampton', b'Southampton'],
      dtype=object)>), ('alone', <tf.Tensor: id=163, shape=(12,), dtype=string, numpy=
array([b'n', b'y', b'n', b'y', b'n', b'y', b'n', b'y', b'y', b'y', b'y',
       b'y'], dtype=object)>)]) 

LABELS: 
 tf.Tensor([0 0 1 0 1 0 0 0 1 0 0 0], shape=(12,), dtype=int32)

Data preprocessing

Categorical data

Some of the columns in the CSV data are categorical columns. That is, the content should be one of a limited set of options.

In the CSV, these options are represented as text. This text needs to be converted to numbers before the model can be trained. To facilitate that, we need to create a list of categorical columns, along with a list of the options available in each column.

CATEGORIES = {
    'sex': ['male', 'female'],
    'class' : ['First', 'Second', 'Third'],
    'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
    'alone' : ['y', 'n']
}

Write a function that takes a tensor of categorical values, matches it to a list of value names, and then performs a one-hot encoding.

def process_categorical_data(data, categories):
  """Returns a one-hot encoded tensor representing categorical values."""
  
  # Remove leading ' '.
  data = tf.strings.regex_replace(data, '^ ', '')
  # Remove trailing '.'.
  data = tf.strings.regex_replace(data, r'\.$', '')
  
  # ONE HOT ENCODE
  # Reshape data from 1d (a list) to a 2d (a list of one-element lists)
  data = tf.reshape(data, [-1, 1])
  # For each element, create a new list of boolean values the length of categories,
  # where the truth value is element == category label
  data = tf.equal(categories, data)
  # Cast booleans to floats.
  data = tf.cast(data, tf.float32)
  
  # The entire encoding can fit on one line:
  # data = tf.cast(tf.equal(categories, tf.reshape(data, [-1, 1])), tf.float32)
  return data

To help you visualize this, we'll take a single category-column tensor from the first batch, preprocess it, and show the before and after state.

class_tensor = examples['class']
class_tensor
<tf.Tensor: id=164, shape=(12,), dtype=string, numpy=
array([b'First', b'Third', b'Third', b'Third', b'First', b'Third',
       b'Third', b'Third', b'First', b'Third', b'First', b'First'],
      dtype=object)>
class_categories = CATEGORIES['class']
class_categories
['First', 'Second', 'Third']
processed_class = process_categorical_data(class_tensor, class_categories)
processed_class
<tf.Tensor: id=189, shape=(12, 3), dtype=float32, numpy=
array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.]], dtype=float32)>

Notice the relationship between the lengths of the two inputs and the shape of the output.

print("Size of batch: ", len(class_tensor.numpy()))
print("Number of category labels: ", len(class_categories))
print("Shape of one-hot encoded tensor: ", processed_class.shape)
Size of batch:  12
Number of category labels:  3
Shape of one-hot encoded tensor:  (12, 3)

Continuous data

Continuous data needs to be normalized, so that the values fall between 0 and 1. To do that, write a function that multiplies each value by 1 over twice the mean of the column values.

The function should also reshape the data into a two dimensional tensor.

def process_continuous_data(data, mean):
  # Normalize data
  data = tf.cast(data, tf.float32) * 1/(2*mean)
  return tf.reshape(data, [-1, 1])

To do this calculation, you need the column means. You would obviously need to compute these in real life, but for this example we'll just provide them.

MEANS = {
    'age' : 29.631308,
    'n_siblings_spouses' : 0.545455,
    'parch' : 0.379585,
    'fare' : 34.385399
}

Again, to see what this function is actually doing, we'll take a single tensor of continuous data and show it before and after processing.

age_tensor = examples['age']
age_tensor
<tf.Tensor: id=162, shape=(12,), dtype=float32, numpy=
array([21., 22., 36., 28., 23., 26., 31., 18., 24., 28., 28., 47.],
      dtype=float32)>
process_continuous_data(age_tensor, MEANS['age'])
<tf.Tensor: id=198, shape=(12, 1), dtype=float32, numpy=
array([[0.354],
       [0.371],
       [0.607],
       [0.472],
       [0.388],
       [0.439],
       [0.523],
       [0.304],
       [0.405],
       [0.472],
       [0.472],
       [0.793]], dtype=float32)>

Preprocess the data

Now assemble these preprocessing tasks into a single function that can be mapped to each batch in the dataset.

def preprocess(features, labels):
  
  # Process categorial features.
  for feature in CATEGORIES.keys():
    features[feature] = process_categorical_data(features[feature],
                                                 CATEGORIES[feature])

  # Process continuous features.
  for feature in MEANS.keys():
    features[feature] = process_continuous_data(features[feature],
                                                MEANS[feature])
  
  # Assemble features into a single tensor.
  features = tf.concat([features[column] for column in FEATURE_COLUMNS], 1)
  
  return features, labels

Now apply that function with tf.Dataset.map, and shuffle the dataset to avoid overfitting.

train_data = raw_train_data.map(preprocess).shuffle(500)
test_data = raw_test_data.map(preprocess)

And let's see what a single example looks like.

examples, labels = next(iter(train_data))

examples, labels
(<tf.Tensor: id=365, shape=(12, 24), dtype=float32, numpy=
 array([[1.   , 0.   , 0.81 , 0.917, 0.   , 0.756, 1.   , 0.   , 0.   ,
         0.   , 0.   , 1.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 1.   ],
        [0.   , 1.   , 0.472, 0.   , 0.   , 0.113, 0.   , 0.   , 1.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.   , 0.   , 0.   , 1.   , 1.   , 0.   ],
        [1.   , 0.   , 0.321, 0.   , 0.   , 0.153, 0.   , 1.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 1.   , 0.   ],
        [1.   , 0.   , 0.574, 0.   , 0.   , 0.189, 0.   , 1.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 1.   , 0.   ],
        [0.   , 1.   , 0.371, 0.   , 0.   , 0.113, 0.   , 0.   , 1.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 1.   , 0.   ],
        [0.   , 1.   , 0.523, 0.   , 2.634, 2.397, 1.   , 0.   , 0.   ,
         0.   , 0.   , 1.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 1.   ],
        [1.   , 0.   , 0.337, 0.   , 0.   , 0.134, 0.   , 0.   , 1.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 1.   , 0.   ],
        [0.   , 1.   , 0.979, 0.   , 0.   , 2.131, 1.   , 0.   , 0.   ,
         0.   , 1.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.   , 1.   , 0.   , 0.   , 1.   , 0.   ],
        [0.   , 1.   , 0.506, 2.75 , 0.   , 0.305, 0.   , 1.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 1.   ],
        [1.   , 0.   , 0.118, 3.667, 1.317, 0.577, 0.   , 0.   , 1.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 1.   ],
        [1.   , 0.   , 1.08 , 0.917, 5.269, 3.824, 1.   , 0.   , 0.   ,
         0.   , 0.   , 1.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 1.   ],
        [1.   , 0.   , 0.304, 0.   , 0.   , 0.121, 0.   , 0.   , 1.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 1.   , 0.   ]], dtype=float32)>,
 <tf.Tensor: id=366, shape=(12,), dtype=int32, numpy=array([1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0], dtype=int32)>)

The examples are in a two dimensional arrays of 12 items each (the batch size). Each item represents a single row in the original CSV file. The labels are a 1d tensor of 12 values.

Build the model

This example uses the Keras Functional API wrapped in a get_model constructor to build up a simple model.

def get_model(input_dim, hidden_units=[100]):
  """Create a Keras model with layers.

  Args:
    input_dim: (int) The shape of an item in a batch. 
    labels_dim: (int) The shape of a label.
    hidden_units: [int] the layer sizes of the DNN (input layer first)
    learning_rate: (float) the learning rate for the optimizer.

  Returns:
    A Keras model.
  """

  inputs = tf.keras.Input(shape=(input_dim,))
  x = inputs

  for units in hidden_units:
    x = tf.keras.layers.Dense(units, activation='relu')(x)
  outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)

  model = tf.keras.Model(inputs, outputs)
 
  return model

The get_model constructor needs to know the input shape of your data (not including the batch size).

input_shape, output_shape = train_data.output_shapes

input_dimension = input_shape.dims[1] # [0] is the batch size

Train, evaluate, and predict

Now the model can be instantiated and trained.

model = get_model(input_dimension)
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])

model.fit(train_data, epochs=20)
Epoch 1/20

W0517 23:01:51.621631 140595026700032 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

53/53 [==============================] - 1s 23ms/step - loss: 0.5998 - accuracy: 0.7018
Epoch 2/20
53/53 [==============================] - 0s 3ms/step - loss: 0.4809 - accuracy: 0.7847
Epoch 3/20
53/53 [==============================] - 0s 3ms/step - loss: 0.4396 - accuracy: 0.8086
Epoch 4/20
53/53 [==============================] - 0s 3ms/step - loss: 0.4239 - accuracy: 0.8150
Epoch 5/20
53/53 [==============================] - 0s 3ms/step - loss: 0.4164 - accuracy: 0.8198
Epoch 6/20
53/53 [==============================] - 0s 3ms/step - loss: 0.4113 - accuracy: 0.8214
Epoch 7/20
53/53 [==============================] - 0s 3ms/step - loss: 0.4073 - accuracy: 0.8230
Epoch 8/20
53/53 [==============================] - 0s 3ms/step - loss: 0.4037 - accuracy: 0.8293
Epoch 9/20
53/53 [==============================] - 0s 3ms/step - loss: 0.4004 - accuracy: 0.8278
Epoch 10/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3975 - accuracy: 0.8278
Epoch 11/20
53/53 [==============================] - 0s 4ms/step - loss: 0.3949 - accuracy: 0.8309
Epoch 12/20
53/53 [==============================] - 0s 4ms/step - loss: 0.3924 - accuracy: 0.8309
Epoch 13/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3900 - accuracy: 0.8325
Epoch 14/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3878 - accuracy: 0.8325
Epoch 15/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3855 - accuracy: 0.8341
Epoch 16/20
53/53 [==============================] - 0s 4ms/step - loss: 0.3836 - accuracy: 0.8341
Epoch 17/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3816 - accuracy: 0.8389
Epoch 18/20
53/53 [==============================] - 0s 4ms/step - loss: 0.3796 - accuracy: 0.8437
Epoch 19/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3777 - accuracy: 0.8437
Epoch 20/20
53/53 [==============================] - 0s 4ms/step - loss: 0.3758 - accuracy: 0.8453

<tensorflow.python.keras.callbacks.History at 0x7fdeb08bb358>

Once the model is trained, we can check its accuracy on the test_data set.

test_loss, test_accuracy = model.evaluate(test_data)

print('\n\nTest Loss {}, Test Accuracy {}'.format(test_loss, test_accuracy))
     22/Unknown - 0s 7ms/step - loss: 0.4349 - accuracy: 0.7955

Test Loss 0.4349003806710243, Test Accuracy 0.7954545617103577

Use tf.keras.Model.predict to infer labels on a batch or a dataset of batches.

predictions = model.predict(test_data)

# Show some results
for prediction, survived in zip(predictions[:10], list(test_data)[0][1][:10]):
  print("Predicted survival: {:.2%}".format(prediction[0]),
        " | Actual outcome: ",
        ("SURVIVED" if bool(survived) else "DIED"))

Predicted survival: 26.49%  | Actual outcome:  DIED
Predicted survival: 96.94%  | Actual outcome:  SURVIVED
Predicted survival: 79.50%  | Actual outcome:  SURVIVED
Predicted survival: 11.12%  | Actual outcome:  DIED
Predicted survival: 87.19%  | Actual outcome:  SURVIVED
Predicted survival: 34.67%  | Actual outcome:  DIED
Predicted survival: 7.53%  | Actual outcome:  DIED
Predicted survival: 11.30%  | Actual outcome:  DIED
Predicted survival: 58.76%  | Actual outcome:  DIED
Predicted survival: 13.14%  | Actual outcome:  DIED