TensorFlow 2.0 RC is available Learn more

Load pandas dataframes with tf.data

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

This tutorial provides an example of how to load pandas dataframes into a tf.data.Dataset.

This tutorials uses a small dataset provided by the Cleveland Clinic Foundation for Heart Disease. There are several hundred rows in the CSV. Each row describes a patient, and each column describes an attribute. We will use this information to predict whether a patient has heart disease, which in this dataset is a binary classification task.

Read data using pandas

from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import pandas as pd
import tensorflow as tf

Download the csv file containing the heart dataset.

csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/applied-dl/heart.csv')
Downloading data from https://storage.googleapis.com/applied-dl/heart.csv
16384/13273 [=====================================] - 0s 0us/step

Read the csv file using pandas.

df = pd.read_csv(csv_file)
df.head()
df.dtypes
age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal         object
target        int64
dtype: object

Convert thal column which is an object in the dataframe to a discrete numerical value.

df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes
df.head()

Load data using tf.data.Dataset

Use tf.data.Dataset.from_tensor_slices to read the values from a pandas dataframe.

One of the advantages of using tf.data.Dataset is it allows you to write simple, highly efficient data pipelines. Read the loading data guide to find out more.

target = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))
for feat, targ in dataset.take(5):
  print ('Features: {}, Target: {}'.format(feat, targ))
Features: [ 63.    1.    1.  145.  233.    1.    2.  150.    0.    2.3   3.    0.
   2. ], Target: 0
Features: [ 67.    1.    4.  160.  286.    0.    2.  108.    1.    1.5   2.    3.
   3. ], Target: 1
Features: [ 67.    1.    4.  120.  229.    0.    2.  129.    1.    2.6   2.    2.
   4. ], Target: 0
Features: [ 37.    1.    3.  130.  250.    0.    0.  187.    0.    3.5   3.    0.
   3. ], Target: 0
Features: [ 41.    0.    2.  130.  204.    0.    2.  172.    0.    1.4   1.    0.
   3. ], Target: 0

Since a pd.Series implements the __array__ protocol it can be used transparently nearly anywhere you would use a np.array or a tf.Tensor.

tf.constant(df['thal'])
<tf.Tensor: id=31, shape=(303,), dtype=int32, numpy=
array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3,
       3, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 4, 2, 4, 3, 4, 3, 4, 4,
       2, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 4,
       4, 2, 3, 3, 4, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 4,
       3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 4, 4, 4, 4,
       3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3, 3, 2, 4, 4, 2, 3, 3, 4, 4, 3, 4,
       3, 3, 4, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
       4, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 2,
       4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 2, 2, 4, 3, 4, 2, 4, 3,
       3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 2, 2, 4, 3, 4, 3, 2, 4, 3, 3, 2,
       4, 4, 4, 4, 3, 0, 3, 3, 3, 3, 1, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4,
       3, 3, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3,
       3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 2, 4, 4, 4, 4], dtype=int32)>

Shuffle and batch the dataset.

train_dataset = dataset.shuffle(len(df)).batch(1)

Create and train a model

def get_compiled_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])

  model.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])
  return model
model = get_compiled_model()
model.fit(train_dataset, epochs=15)
Epoch 1/15

WARNING: Logging before flag parsing goes to stderr.
W0813 05:58:55.200097 140133270013696 deprecation.py:323] From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

303/303 [==============================] - 3s 9ms/step - loss: 4.2253 - accuracy: 0.7081
Epoch 2/15
303/303 [==============================] - 1s 3ms/step - loss: 4.2253 - accuracy: 0.7081
Epoch 3/15
303/303 [==============================] - 1s 3ms/step - loss: 4.2253 - accuracy: 0.7081
Epoch 4/15
303/303 [==============================] - 1s 3ms/step - loss: 4.2253 - accuracy: 0.7081
Epoch 5/15
303/303 [==============================] - 1s 3ms/step - loss: 4.2253 - accuracy: 0.7081
Epoch 6/15
303/303 [==============================] - 1s 3ms/step - loss: 4.2253 - accuracy: 0.7081
Epoch 7/15
303/303 [==============================] - 1s 3ms/step - loss: 4.2253 - accuracy: 0.7081
Epoch 8/15
303/303 [==============================] - 1s 3ms/step - loss: 4.2253 - accuracy: 0.7081
Epoch 9/15
303/303 [==============================] - 1s 3ms/step - loss: 4.2253 - accuracy: 0.7081
Epoch 10/15
303/303 [==============================] - 1s 3ms/step - loss: 4.2253 - accuracy: 0.7081
Epoch 11/15
303/303 [==============================] - 1s 3ms/step - loss: 4.2253 - accuracy: 0.7081
Epoch 12/15
303/303 [==============================] - 1s 3ms/step - loss: 4.2253 - accuracy: 0.7081
Epoch 13/15
303/303 [==============================] - 1s 3ms/step - loss: 4.2253 - accuracy: 0.7081
Epoch 14/15
303/303 [==============================] - 1s 3ms/step - loss: 4.2253 - accuracy: 0.7081
Epoch 15/15
303/303 [==============================] - 1s 3ms/step - loss: 4.2253 - accuracy: 0.7081

<tensorflow.python.keras.callbacks.History at 0x7f72cfdd06d8>

Alternative to feature columns

Passing a dictionary as an input to a model is as easy as creating a matching dictionary of tf.keras.layers.Input layers, applying any pre-processing and stacking them up using the functional api. You can use this as an alternative to feature columns.

inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)

x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model_func = tf.keras.Model(inputs=inputs, outputs=output)

model_func.compile(optimizer='adam',
                   loss='binary_crossentropy',
                   metrics=['accuracy'])

The easiest way to preserve the column structure of a pd.DataFrame when used with tf.data is to convert the pd.DataFrame to a dict, and slice that dictionary.

dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
  print (dict_slice)
({'ca': <tf.Tensor: id=51185, shape=(16,), dtype=int32, numpy=array([0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'restecg': <tf.Tensor: id=51191, shape=(16,), dtype=int32, numpy=array([2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0], dtype=int32)>, 'exang': <tf.Tensor: id=51188, shape=(16,), dtype=int32, numpy=array([0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'cp': <tf.Tensor: id=51187, shape=(16,), dtype=int32, numpy=array([1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3], dtype=int32)>, 'thalach': <tf.Tensor: id=51195, shape=(16,), dtype=int32, numpy=
array([150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142,
       173, 162, 174], dtype=int32)>, 'chol': <tf.Tensor: id=51186, shape=(16,), dtype=int32, numpy=
array([233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256,
       263, 199, 168], dtype=int32)>, 'fbs': <tf.Tensor: id=51189, shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int32)>, 'thal': <tf.Tensor: id=51194, shape=(16,), dtype=int32, numpy=array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3], dtype=int32)>, 'trestbps': <tf.Tensor: id=51196, shape=(16,), dtype=int32, numpy=
array([145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130,
       120, 172, 150], dtype=int32)>, 'slope': <tf.Tensor: id=51193, shape=(16,), dtype=int32, numpy=array([3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1], dtype=int32)>, 'sex': <tf.Tensor: id=51192, shape=(16,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int32)>, 'oldpeak': <tf.Tensor: id=51190, shape=(16,), dtype=float32, numpy=
array([2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6,
       0. , 0.5, 1.6], dtype=float32)>, 'age': <tf.Tensor: id=51184, shape=(16,), dtype=int32, numpy=
array([63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57],
      dtype=int32)>}, <tf.Tensor: id=51197, shape=(16,), dtype=int64, numpy=array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0])>)
model_func.fit(dict_slices, epochs=15)
W0813 05:59:11.682256 140133270013696 training_utils.py:1436] Expected a shuffled dataset but input dataset `x` is not shuffled. Please invoke `shuffle()` on input dataset.

Epoch 1/15
19/19 [==============================] - 1s 36ms/step - loss: 4.2378 - accuracy: 0.7261
Epoch 2/15
19/19 [==============================] - 0s 7ms/step - loss: 4.2378 - accuracy: 0.7261
Epoch 3/15
19/19 [==============================] - 0s 7ms/step - loss: 4.2378 - accuracy: 0.7261
Epoch 4/15
19/19 [==============================] - 0s 7ms/step - loss: 4.2378 - accuracy: 0.7261
Epoch 5/15
19/19 [==============================] - 0s 7ms/step - loss: 4.2378 - accuracy: 0.7261
Epoch 6/15
19/19 [==============================] - 0s 7ms/step - loss: 4.2378 - accuracy: 0.7261
Epoch 7/15
19/19 [==============================] - 0s 7ms/step - loss: 4.2378 - accuracy: 0.7261
Epoch 8/15
19/19 [==============================] - 0s 7ms/step - loss: 4.2378 - accuracy: 0.7261
Epoch 9/15
19/19 [==============================] - 0s 7ms/step - loss: 4.2378 - accuracy: 0.7261
Epoch 10/15
19/19 [==============================] - 0s 7ms/step - loss: 4.2378 - accuracy: 0.7261
Epoch 11/15
19/19 [==============================] - 0s 7ms/step - loss: 4.2378 - accuracy: 0.7261
Epoch 12/15
19/19 [==============================] - 0s 7ms/step - loss: 4.2378 - accuracy: 0.7261
Epoch 13/15
19/19 [==============================] - 0s 7ms/step - loss: 4.2378 - accuracy: 0.7261
Epoch 14/15
19/19 [==============================] - 0s 7ms/step - loss: 4.2378 - accuracy: 0.7261
Epoch 15/15
19/19 [==============================] - 0s 7ms/step - loss: 4.2378 - accuracy: 0.7261

<tensorflow.python.keras.callbacks.History at 0x7f728051fc88>