Load a pandas DataFrame

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

This tutorial provides examples of how to load pandas DataFrames into TensorFlow.

You will use a small heart disease dataset provided by the UCI Machine Learning Repository. There are several hundred rows in the CSV. Each row describes a patient, and each column describes an attribute. You will use this information to predict whether a patient has heart disease, which is a binary classification task.

Read data using pandas

import pandas as pd
import tensorflow as tf

SHUFFLE_BUFFER = 500
BATCH_SIZE = 2
2024-01-17 04:53:08.144917: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-17 04:53:08.144965: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-17 04:53:08.146421: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

Download the CSV file containing the heart disease dataset:

csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/download.tensorflow.org/data/heart.csv')
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/heart.csv
13273/13273 [==============================] - 0s 0us/step

Read the CSV file using pandas:

df = pd.read_csv(csv_file)

This is what the data looks like:

df.head()
df.dtypes
age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal         object
target        int64
dtype: object

You will build models to predict the label contained in the target column.

target = df.pop('target')

A DataFrame as an array

If your data has a uniform datatype, or dtype, it's possible to use a pandas DataFrame anywhere you could use a NumPy array. This works because the pandas.DataFrame class supports the __array__ protocol, and TensorFlow's tf.convert_to_tensor function accepts objects that support the protocol.

Take the numeric features from the dataset (skip the categorical features for now):

numeric_feature_names = ['age', 'thalach', 'trestbps',  'chol', 'oldpeak']
numeric_features = df[numeric_feature_names]
numeric_features.head()

The DataFrame can be converted to a NumPy array using the DataFrame.values property or numpy.array(df). To convert it to a tensor, use tf.convert_to_tensor:

tf.convert_to_tensor(numeric_features)
<tf.Tensor: shape=(303, 5), dtype=float64, numpy=
array([[ 63. , 150. , 145. , 233. ,   2.3],
       [ 67. , 108. , 160. , 286. ,   1.5],
       [ 67. , 129. , 120. , 229. ,   2.6],
       ...,
       [ 65. , 127. , 135. , 254. ,   2.8],
       [ 48. , 150. , 130. , 256. ,   0. ],
       [ 63. , 154. , 150. , 407. ,   4. ]])>

In general, if an object can be converted to a tensor with tf.convert_to_tensor it can be passed anywhere you can pass a tf.Tensor.

With Model.fit

A DataFrame, interpreted as a single tensor, can be used directly as an argument to the Model.fit method.

Below is an example of training a model on the numeric features of the dataset.

The first step is to normalize the input ranges. Use a tf.keras.layers.Normalization layer for that.

To set the layer's mean and standard-deviation before running it be sure to call the Normalization.adapt method:

normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(numeric_features)

Call the layer on the first three rows of the DataFrame to visualize an example of the output from this layer:

normalizer(numeric_features.iloc[:3])
<tf.Tensor: shape=(3, 5), dtype=float32, numpy=
array([[ 0.93383914,  0.03480718,  0.74578077, -0.26008663,  1.0680453 ],
       [ 1.3782105 , -1.7806165 ,  1.5923285 ,  0.7573877 ,  0.38022864],
       [ 1.3782105 , -0.87290466, -0.6651321 , -0.33687714,  1.3259765 ]],
      dtype=float32)>

Use the normalization layer as the first layer of a simple model:

def get_basic_model():
  model = tf.keras.Sequential([
    normalizer,
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1)
  ])

  model.compile(optimizer='adam',
                loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                metrics=['accuracy'])
  return model

When you pass the DataFrame as the x argument to Model.fit, Keras treats the DataFrame as it would a NumPy array:

model = get_basic_model()
model.fit(numeric_features, target, epochs=15, batch_size=BATCH_SIZE)
Epoch 1/15
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1705467194.755627   48535 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
152/152 [==============================] - 2s 3ms/step - loss: 0.6300 - accuracy: 0.7261
Epoch 2/15
152/152 [==============================] - 0s 3ms/step - loss: 0.5773 - accuracy: 0.7261
Epoch 3/15
152/152 [==============================] - 0s 3ms/step - loss: 0.5332 - accuracy: 0.7327
Epoch 4/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4982 - accuracy: 0.7393
Epoch 5/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4755 - accuracy: 0.7624
Epoch 6/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4588 - accuracy: 0.7723
Epoch 7/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4503 - accuracy: 0.7822
Epoch 8/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4428 - accuracy: 0.7789
Epoch 9/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4367 - accuracy: 0.7822
Epoch 10/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4327 - accuracy: 0.7888
Epoch 11/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4294 - accuracy: 0.8020
Epoch 12/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4259 - accuracy: 0.7888
Epoch 13/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4226 - accuracy: 0.7921
Epoch 14/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4228 - accuracy: 0.8053
Epoch 15/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4206 - accuracy: 0.7954
<keras.src.callbacks.History at 0x7f04f473b550>

With tf.data

If you want to apply tf.data transformations to a DataFrame of a uniform dtype, the Dataset.from_tensor_slices method will create a dataset that iterates over the rows of the DataFrame. Each row is initially a vector of values. To train a model, you need (inputs, labels) pairs, so pass (features, labels) and Dataset.from_tensor_slices will return the needed pairs of slices:

numeric_dataset = tf.data.Dataset.from_tensor_slices((numeric_features, target))

for row in numeric_dataset.take(3):
  print(row)
(<tf.Tensor: shape=(5,), dtype=float64, numpy=array([ 63. , 150. , 145. , 233. ,   2.3])>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(5,), dtype=float64, numpy=array([ 67. , 108. , 160. , 286. ,   1.5])>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: shape=(5,), dtype=float64, numpy=array([ 67. , 129. , 120. , 229. ,   2.6])>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
numeric_batches = numeric_dataset.shuffle(1000).batch(BATCH_SIZE)

model = get_basic_model()
model.fit(numeric_batches, epochs=15)
Epoch 1/15
152/152 [==============================] - 1s 3ms/step - loss: 0.6518 - accuracy: 0.7030
Epoch 2/15
152/152 [==============================] - 0s 3ms/step - loss: 0.5140 - accuracy: 0.7261
Epoch 3/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4753 - accuracy: 0.7294
Epoch 4/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4586 - accuracy: 0.7261
Epoch 5/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4498 - accuracy: 0.7393
Epoch 6/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4430 - accuracy: 0.7459
Epoch 7/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4385 - accuracy: 0.7558
Epoch 8/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4354 - accuracy: 0.7624
Epoch 9/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4323 - accuracy: 0.7756
Epoch 10/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4289 - accuracy: 0.7789
Epoch 11/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4276 - accuracy: 0.7789
Epoch 12/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4248 - accuracy: 0.7855
Epoch 13/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4227 - accuracy: 0.7789
Epoch 14/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4208 - accuracy: 0.7921
Epoch 15/15
152/152 [==============================] - 0s 3ms/step - loss: 0.4182 - accuracy: 0.7921
<keras.src.callbacks.History at 0x7f04f41e0550>

A DataFrame as a dictionary

When you start dealing with heterogeneous data, it is no longer possible to treat the DataFrame as if it were a single array. TensorFlow tensors require that all elements have the same dtype.

So, in this case, you need to start treating it as a dictionary of columns, where each column has a uniform dtype. A DataFrame is a lot like a dictionary of arrays, so typically all you need to do is cast the DataFrame to a Python dict. Many important TensorFlow APIs support (nested-)dictionaries of arrays as inputs.

tf.data input pipelines handle this quite well. All tf.data operations handle dictionaries and tuples automatically. So, to make a dataset of dictionary-examples from a DataFrame, just cast it to a dict before slicing it with Dataset.from_tensor_slices:

numeric_dict_ds = tf.data.Dataset.from_tensor_slices((dict(numeric_features), target))

Here are the first three examples from that dataset:

for row in numeric_dict_ds.take(3):
  print(row)
({'age': <tf.Tensor: shape=(), dtype=int64, numpy=63>, 'thalach': <tf.Tensor: shape=(), dtype=int64, numpy=150>, 'trestbps': <tf.Tensor: shape=(), dtype=int64, numpy=145>, 'chol': <tf.Tensor: shape=(), dtype=int64, numpy=233>, 'oldpeak': <tf.Tensor: shape=(), dtype=float64, numpy=2.3>}, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
({'age': <tf.Tensor: shape=(), dtype=int64, numpy=67>, 'thalach': <tf.Tensor: shape=(), dtype=int64, numpy=108>, 'trestbps': <tf.Tensor: shape=(), dtype=int64, numpy=160>, 'chol': <tf.Tensor: shape=(), dtype=int64, numpy=286>, 'oldpeak': <tf.Tensor: shape=(), dtype=float64, numpy=1.5>}, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
({'age': <tf.Tensor: shape=(), dtype=int64, numpy=67>, 'thalach': <tf.Tensor: shape=(), dtype=int64, numpy=129>, 'trestbps': <tf.Tensor: shape=(), dtype=int64, numpy=120>, 'chol': <tf.Tensor: shape=(), dtype=int64, numpy=229>, 'oldpeak': <tf.Tensor: shape=(), dtype=float64, numpy=2.6>}, <tf.Tensor: shape=(), dtype=int64, numpy=0>)

Dictionaries with Keras

Typically, Keras models and layers expect a single input tensor, but these classes can accept and return nested structures of dictionaries, tuples and tensors. These structures are known as "nests" (refer to the tf.nest module for details).

There are two equivalent ways you can write a Keras model that accepts a dictionary as input.

1. The Model-subclass style

You write a subclass of tf.keras.Model (or tf.keras.Layer). You directly handle the inputs, and create the outputs:

def stack_dict(inputs, fun=tf.stack):
    values = []
    for key in sorted(inputs.keys()):
      values.append(tf.cast(inputs[key], tf.float32))

    return fun(values, axis=-1)

This model can accept either a dictionary of columns or a dataset of dictionary-elements for training:

model.fit(dict(numeric_features), target, epochs=5, batch_size=BATCH_SIZE)
Epoch 1/5
WARNING:tensorflow:5 out of the last 5 calls to <function _BaseOptimizer._update_step_xla at 0x7f04f407cb80> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
WARNING:tensorflow:6 out of the last 6 calls to <function _BaseOptimizer._update_step_xla at 0x7f04f407cb80> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
152/152 [==============================] - 4s 24ms/step - loss: 0.6301 - accuracy: 0.7294
Epoch 2/5
152/152 [==============================] - 4s 24ms/step - loss: 0.5304 - accuracy: 0.7393
Epoch 3/5
152/152 [==============================] - 4s 25ms/step - loss: 0.4706 - accuracy: 0.7657
Epoch 4/5
152/152 [==============================] - 4s 24ms/step - loss: 0.4451 - accuracy: 0.7954
Epoch 5/5
152/152 [==============================] - 4s 24ms/step - loss: 0.4370 - accuracy: 0.7855
<keras.src.callbacks.History at 0x7f04f401e910>
numeric_dict_batches = numeric_dict_ds.shuffle(SHUFFLE_BUFFER).batch(BATCH_SIZE)
model.fit(numeric_dict_batches, epochs=5)
Epoch 1/5
152/152 [==============================] - 3s 21ms/step - loss: 0.4304 - accuracy: 0.7987
Epoch 2/5
152/152 [==============================] - 3s 22ms/step - loss: 0.4258 - accuracy: 0.7987
Epoch 3/5
152/152 [==============================] - 3s 22ms/step - loss: 0.4249 - accuracy: 0.7987
Epoch 4/5
152/152 [==============================] - 3s 22ms/step - loss: 0.4224 - accuracy: 0.7987
Epoch 5/5
152/152 [==============================] - 3s 22ms/step - loss: 0.4200 - accuracy: 0.8020
<keras.src.callbacks.History at 0x7f04ec482550>

Here are the predictions for the first three examples:

model.predict(dict(numeric_features.iloc[:3]))
1/1 [==============================] - 0s 55ms/step
array([[[0.00656018]],

       [[0.6756177 ]],

       [[0.30398777]]], dtype=float32)

2. The Keras functional style

inputs = {}
for name, column in numeric_features.items():
  inputs[name] = tf.keras.Input(
      shape=(1,), name=name, dtype=tf.float32)

inputs
{'age': <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'age')>,
 'thalach': <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'thalach')>,
 'trestbps': <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'trestbps')>,
 'chol': <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'chol')>,
 'oldpeak': <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'oldpeak')>}
x = stack_dict(inputs, fun=tf.concat)

normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(stack_dict(dict(numeric_features)))

x = normalizer(x)
x = tf.keras.layers.Dense(10, activation='relu')(x)
x = tf.keras.layers.Dense(10, activation='relu')(x)
x = tf.keras.layers.Dense(1)(x)

model = tf.keras.Model(inputs, x)

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'],
              run_eagerly=True)
tf.keras.utils.plot_model(model, rankdir="LR", show_shapes=True)

png

You can train the functional model the same way as the model subclass:

model.fit(dict(numeric_features), target, epochs=5, batch_size=BATCH_SIZE)
Epoch 1/5
152/152 [==============================] - 4s 23ms/step - loss: 0.6733 - accuracy: 0.7294
Epoch 2/5
152/152 [==============================] - 3s 23ms/step - loss: 0.5835 - accuracy: 0.7558
Epoch 3/5
152/152 [==============================] - 3s 23ms/step - loss: 0.5253 - accuracy: 0.7591
Epoch 4/5
152/152 [==============================] - 3s 23ms/step - loss: 0.4866 - accuracy: 0.7558
Epoch 5/5
152/152 [==============================] - 3s 22ms/step - loss: 0.4624 - accuracy: 0.7723
<keras.src.callbacks.History at 0x7f067d201340>
numeric_dict_batches = numeric_dict_ds.shuffle(SHUFFLE_BUFFER).batch(BATCH_SIZE)
model.fit(numeric_dict_batches, epochs=5)
Epoch 1/5
152/152 [==============================] - 4s 23ms/step - loss: 0.4491 - accuracy: 0.7624
Epoch 2/5
152/152 [==============================] - 4s 23ms/step - loss: 0.4422 - accuracy: 0.7855
Epoch 3/5
152/152 [==============================] - 4s 23ms/step - loss: 0.4370 - accuracy: 0.7888
Epoch 4/5
152/152 [==============================] - 4s 23ms/step - loss: 0.4328 - accuracy: 0.7855
Epoch 5/5
152/152 [==============================] - 4s 24ms/step - loss: 0.4308 - accuracy: 0.7888
<keras.src.callbacks.History at 0x7f04ec482160>

Full example

If you're passing a heterogeneous DataFrame to Keras, each column may need unique preprocessing. You could do this preprocessing directly in the DataFrame, but for a model to work correctly, inputs always need to be preprocessed the same way. So, the best approach is to build the preprocessing into the model. Keras preprocessing layers cover many common tasks.

Build the preprocessing head

In this dataset some of the "integer" features in the raw data are actually Categorical indices. These indices are not really ordered numeric values (refer to the the dataset description for details). Because these are unordered they are inappropriate to feed directly to the model; the model would interpret them as being ordered. To use these inputs you'll need to encode them, either as one-hot vectors or embedding vectors. The same applies to string-categorical features.

Binary features on the other hand do not generally need to be encoded or normalized.

Start by by creating a list of the features that fall into each group:

binary_feature_names = ['sex', 'fbs', 'exang']
categorical_feature_names = ['cp', 'restecg', 'slope', 'thal', 'ca']

The next step is to build a preprocessing model that will apply appropriate preprocessing to each input and concatenate the results.

This section uses the Keras Functional API to implement the preprocessing. You start by creating one tf.keras.Input for each column of the dataframe:

inputs = {}
for name, column in df.items():
  if type(column[0]) == str:
    dtype = tf.string
  elif (name in categorical_feature_names or
        name in binary_feature_names):
    dtype = tf.int64
  else:
    dtype = tf.float32

  inputs[name] = tf.keras.Input(shape=(), name=name, dtype=dtype)
inputs
{'age': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'age')>,
 'sex': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'sex')>,
 'cp': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'cp')>,
 'trestbps': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'trestbps')>,
 'chol': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'chol')>,
 'fbs': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'fbs')>,
 'restecg': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'restecg')>,
 'thalach': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'thalach')>,
 'exang': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'exang')>,
 'oldpeak': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'oldpeak')>,
 'slope': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'slope')>,
 'ca': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'ca')>,
 'thal': <KerasTensor: shape=(None,) dtype=string (created by layer 'thal')>}

For each input you'll apply some transformations using Keras layers and TensorFlow ops. Each feature starts as a batch of scalars (shape=(batch,)). The output for each should be a batch of tf.float32 vectors (shape=(batch, n)). The last step will concatenate all those vectors together.

Binary inputs

Since the binary inputs don't need any preprocessing, just add the vector axis, cast them to float32 and add them to the list of preprocessed inputs:

preprocessed = []

for name in binary_feature_names:
  inp = inputs[name]
  inp = inp[:, tf.newaxis]
  float_value = tf.cast(inp, tf.float32)
  preprocessed.append(float_value)

preprocessed
[<KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_5')>,
 <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_6')>,
 <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_7')>]

Numeric inputs

Like in the earlier section you'll want to run these numeric inputs through a tf.keras.layers.Normalization layer before using them. The difference is that this time they're input as a dict. The code below collects the numeric features from the DataFrame, stacks them together and passes those to the Normalization.adapt method.

normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(stack_dict(dict(numeric_features)))

The code below stacks the numeric features and runs them through the normalization layer.

numeric_inputs = {}
for name in numeric_feature_names:
  numeric_inputs[name]=inputs[name]

numeric_inputs = stack_dict(numeric_inputs)
numeric_normalized = normalizer(numeric_inputs)

preprocessed.append(numeric_normalized)

preprocessed
[<KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_5')>,
 <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_6')>,
 <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_7')>,
 <KerasTensor: shape=(None, 5) dtype=float32 (created by layer 'normalization_3')>]

Categorical features

To use categorical features you'll first need to encode them into either binary vectors or embeddings. Since these features only contain a small number of categories, convert the inputs directly to one-hot vectors using the output_mode='one_hot' option, supported by both the tf.keras.layers.StringLookup and tf.keras.layers.IntegerLookup layers.

Here is an example of how these layers work:

vocab = ['a','b','c']
lookup = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='one_hot')
lookup(['c','a','a','b','zzz'])
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [1., 0., 0., 0.]], dtype=float32)>
vocab = [1,4,7,99]
lookup = tf.keras.layers.IntegerLookup(vocabulary=vocab, output_mode='one_hot')

lookup([-1,4,1])
<tf.Tensor: shape=(3, 5), dtype=float32, numpy=
array([[1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]], dtype=float32)>

To determine the vocabulary for each input, create a layer to convert that vocabulary to a one-hot vector:

for name in categorical_feature_names:
  vocab = sorted(set(df[name]))
  print(f'name: {name}')
  print(f'vocab: {vocab}\n')

  if type(vocab[0]) is str:
    lookup = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='one_hot')
  else:
    lookup = tf.keras.layers.IntegerLookup(vocabulary=vocab, output_mode='one_hot')

  x = inputs[name][:, tf.newaxis]
  x = lookup(x)
  preprocessed.append(x)
name: cp
vocab: [0, 1, 2, 3, 4]

name: restecg
vocab: [0, 1, 2]

name: slope
vocab: [1, 2, 3]

name: thal
vocab: ['1', '2', 'fixed', 'normal', 'reversible']

name: ca
vocab: [0, 1, 2, 3]

Assemble the preprocessing head

At this point preprocessed is just a Python list of all the preprocessing results, each result has a shape of (batch_size, depth):

preprocessed
[<KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_5')>,
 <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_6')>,
 <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_7')>,
 <KerasTensor: shape=(None, 5) dtype=float32 (created by layer 'normalization_3')>,
 <KerasTensor: shape=(None, 6) dtype=float32 (created by layer 'integer_lookup_1')>,
 <KerasTensor: shape=(None, 4) dtype=float32 (created by layer 'integer_lookup_2')>,
 <KerasTensor: shape=(None, 4) dtype=float32 (created by layer 'integer_lookup_3')>,
 <KerasTensor: shape=(None, 6) dtype=float32 (created by layer 'string_lookup_1')>,
 <KerasTensor: shape=(None, 5) dtype=float32 (created by layer 'integer_lookup_4')>]

Concatenate all the preprocessed features along the depth axis, so each dictionary-example is converted into a single vector. The vector contains categorical features, numeric features, and categorical one-hot features:

preprocessed_result = tf.concat(preprocessed, axis=-1)
preprocessed_result
<KerasTensor: shape=(None, 33) dtype=float32 (created by layer 'tf.concat_1')>

Now create a model out of that calculation so it can be reused:

preprocessor = tf.keras.Model(inputs, preprocessed_result)
tf.keras.utils.plot_model(preprocessor, rankdir="LR", show_shapes=True)

png

To test the preprocessor, use the DataFrame.iloc accessor to slice the first example from the DataFrame. Then convert it to a dictionary and pass the dictionary to the preprocessor. The result is a single vector containing the binary features, normalized numeric features and the one-hot categorical features, in that order:

preprocessor(dict(df.iloc[:1]))
<tf.Tensor: shape=(1, 33), dtype=float32, numpy=
array([[ 1.        ,  1.        ,  0.        ,  0.93383914, -0.26008663,
         1.0680453 ,  0.03480718,  0.74578077,  0.        ,  0.        ,

         1.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  1.        ,  0.        ,  0.        ,
         0.        ,  1.        ,  0.        ,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  0.        ,  1.        ,
         0.        ,  0.        ,  0.        ]], dtype=float32)>

Create and train a model

Now build the main body of the model. Use the same configuration as in the previous example: A couple of Dense rectified-linear layers and a Dense(1) output layer for the classification.

body = tf.keras.Sequential([
  tf.keras.layers.Dense(10, activation='relu'),
  tf.keras.layers.Dense(10, activation='relu'),
  tf.keras.layers.Dense(1)
])

Now put the two pieces together using the Keras functional API.

inputs
{'age': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'age')>,
 'sex': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'sex')>,
 'cp': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'cp')>,
 'trestbps': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'trestbps')>,
 'chol': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'chol')>,
 'fbs': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'fbs')>,
 'restecg': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'restecg')>,
 'thalach': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'thalach')>,
 'exang': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'exang')>,
 'oldpeak': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'oldpeak')>,
 'slope': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'slope')>,
 'ca': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'ca')>,
 'thal': <KerasTensor: shape=(None,) dtype=string (created by layer 'thal')>}
x = preprocessor(inputs)
x
<KerasTensor: shape=(None, 33) dtype=float32 (created by layer 'model_1')>
result = body(x)
result
<KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'sequential_3')>
model = tf.keras.Model(inputs, result)

model.compile(optimizer='adam',
                loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                metrics=['accuracy'])

This model expects a dictionary of inputs. The simplest way to pass it the data is to convert the DataFrame to a dict and pass that dict as the x argument to Model.fit:

history = model.fit(dict(df), target, epochs=5, batch_size=BATCH_SIZE)
Epoch 1/5
152/152 [==============================] - 2s 4ms/step - loss: 0.6555 - accuracy: 0.6964
Epoch 2/5
152/152 [==============================] - 1s 4ms/step - loss: 0.4806 - accuracy: 0.7261
Epoch 3/5
152/152 [==============================] - 1s 4ms/step - loss: 0.4001 - accuracy: 0.7360
Epoch 4/5
152/152 [==============================] - 1s 4ms/step - loss: 0.3482 - accuracy: 0.7591
Epoch 5/5
152/152 [==============================] - 1s 4ms/step - loss: 0.3182 - accuracy: 0.8119

Using tf.data works as well:

ds = tf.data.Dataset.from_tensor_slices((
    dict(df),
    target
))

ds = ds.batch(BATCH_SIZE)
import pprint

for x, y in ds.take(1):
  pprint.pprint(x)
  print()
  print(y)
{'age': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([63, 67])>,
 'ca': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([0, 3])>,
 'chol': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([233, 286])>,
 'cp': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([1, 4])>,
 'exang': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([0, 1])>,
 'fbs': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([1, 0])>,
 'oldpeak': <tf.Tensor: shape=(2,), dtype=float64, numpy=array([2.3, 1.5])>,
 'restecg': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([2, 2])>,
 'sex': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([1, 1])>,
 'slope': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([3, 2])>,
 'thal': <tf.Tensor: shape=(2,), dtype=string, numpy=array([b'fixed', b'normal'], dtype=object)>,
 'thalach': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([150, 108])>,
 'trestbps': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([145, 160])>}

tf.Tensor([0 1], shape=(2,), dtype=int64)
history = model.fit(ds, epochs=5)
Epoch 1/5
152/152 [==============================] - 1s 4ms/step - loss: 0.2986 - accuracy: 0.8350
Epoch 2/5
152/152 [==============================] - 1s 4ms/step - loss: 0.2876 - accuracy: 0.8548
Epoch 3/5
152/152 [==============================] - 1s 4ms/step - loss: 0.2779 - accuracy: 0.8548
Epoch 4/5
152/152 [==============================] - 1s 4ms/step - loss: 0.2700 - accuracy: 0.8581
Epoch 5/5
152/152 [==============================] - 1s 4ms/step - loss: 0.2626 - accuracy: 0.8614