Missed TensorFlow Dev Summit? Check out the video playlist. Watch recordings

Load a pandas.DataFrame

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

This tutorial provides an example of how to load pandas dataframes into a tf.data.Dataset.

This tutorials uses a small dataset provided by the Cleveland Clinic Foundation for Heart Disease. There are several hundred rows in the CSV. Each row describes a patient, and each column describes an attribute. We will use this information to predict whether a patient has heart disease, which in this dataset is a binary classification task.

Read data using pandas

import pandas as pd
import tensorflow as tf

Download the csv file containing the heart dataset.

csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/applied-dl/heart.csv')
Downloading data from https://storage.googleapis.com/applied-dl/heart.csv
16384/13273 [=====================================] - 0s 0us/step

Read the csv file using pandas.

df = pd.read_csv(csv_file)
df.head()
df.dtypes
age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal         object
target        int64
dtype: object

Convert thal column which is an object in the dataframe to a discrete numerical value.

df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes
df.head()

Load data using tf.data.Dataset

Use tf.data.Dataset.from_tensor_slices to read the values from a pandas dataframe.

One of the advantages of using tf.data.Dataset is it allows you to write simple, highly efficient data pipelines. Read the loading data guide to find out more.

target = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))
for feat, targ in dataset.take(5):
  print ('Features: {}, Target: {}'.format(feat, targ))
Features: [ 63.    1.    1.  145.  233.    1.    2.  150.    0.    2.3   3.    0.

   2. ], Target: 0
Features: [ 67.    1.    4.  160.  286.    0.    2.  108.    1.    1.5   2.    3.
   3. ], Target: 1
Features: [ 67.    1.    4.  120.  229.    0.    2.  129.    1.    2.6   2.    2.
   4. ], Target: 0
Features: [ 37.    1.    3.  130.  250.    0.    0.  187.    0.    3.5   3.    0.
   3. ], Target: 0
Features: [ 41.    0.    2.  130.  204.    0.    2.  172.    0.    1.4   1.    0.
   3. ], Target: 0

Since a pd.Series implements the __array__ protocol it can be used transparently nearly anywhere you would use a np.array or a tf.Tensor.

tf.constant(df['thal'])
<tf.Tensor: shape=(303,), dtype=int32, numpy=
array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3,
       3, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 4, 2, 4, 3, 4, 3, 4, 4,
       2, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 4,
       4, 2, 3, 3, 4, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 4,
       3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 4, 4, 4, 4,
       3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3, 3, 2, 4, 4, 2, 3, 3, 4, 4, 3, 4,
       3, 3, 4, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
       4, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 2,
       4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 2, 2, 4, 3, 4, 2, 4, 3,
       3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 2, 2, 4, 3, 4, 3, 2, 4, 3, 3, 2,
       4, 4, 4, 4, 3, 0, 3, 3, 3, 3, 1, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4,
       3, 3, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3,
       3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 2, 4, 4, 4, 4], dtype=int32)>

Shuffle and batch the dataset.

train_dataset = dataset.shuffle(len(df)).batch(1)

Create and train a model

def get_compiled_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1)
  ])

  model.compile(optimizer='adam',
                loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                metrics=['accuracy'])
  return model
model = get_compiled_model()
model.fit(train_dataset, epochs=15)
Train for 303 steps
Epoch 1/15
303/303 [==============================] - 1s 4ms/step - loss: 0.6813 - accuracy: 0.7129
Epoch 2/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5775 - accuracy: 0.7294
Epoch 3/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5641 - accuracy: 0.7096
Epoch 4/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5539 - accuracy: 0.7228
Epoch 5/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5548 - accuracy: 0.7327
Epoch 6/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5309 - accuracy: 0.7525
Epoch 7/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5212 - accuracy: 0.7459
Epoch 8/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5269 - accuracy: 0.7294
Epoch 9/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5145 - accuracy: 0.7360
Epoch 10/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5100 - accuracy: 0.7492
Epoch 11/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5172 - accuracy: 0.7492
Epoch 12/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5106 - accuracy: 0.7393
Epoch 13/15
303/303 [==============================] - 1s 2ms/step - loss: 0.4938 - accuracy: 0.7426
Epoch 14/15
303/303 [==============================] - 1s 2ms/step - loss: 0.4861 - accuracy: 0.7294
Epoch 15/15
303/303 [==============================] - 1s 2ms/step - loss: 0.4809 - accuracy: 0.7459

<tensorflow.python.keras.callbacks.History at 0x7f23c0335e48>

Alternative to feature columns

Passing a dictionary as an input to a model is as easy as creating a matching dictionary of tf.keras.layers.Input layers, applying any pre-processing and stacking them up using the functional api. You can use this as an alternative to feature columns.

inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)

x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1)(x)

model_func = tf.keras.Model(inputs=inputs, outputs=output)

model_func.compile(optimizer='adam',
                   loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                   metrics=['accuracy'])

The easiest way to preserve the column structure of a pd.DataFrame when used with tf.data is to convert the pd.DataFrame to a dict, and slice that dictionary.

dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
  print (dict_slice)
({'age': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57],
      dtype=int32)>, 'sex': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int32)>, 'cp': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3], dtype=int32)>, 'trestbps': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130,
       120, 172, 150], dtype=int32)>, 'chol': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256,
       263, 199, 168], dtype=int32)>, 'fbs': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int32)>, 'restecg': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0], dtype=int32)>, 'thalach': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142,
       173, 162, 174], dtype=int32)>, 'exang': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'oldpeak': <tf.Tensor: shape=(16,), dtype=float32, numpy=
array([2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6,

       0. , 0.5, 1.6], dtype=float32)>, 'slope': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1], dtype=int32)>, 'ca': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'thal': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3], dtype=int32)>}, <tf.Tensor: shape=(16,), dtype=int64, numpy=array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0])>)
model_func.fit(dict_slices, epochs=15)
Train for 19 steps
Epoch 1/15
19/19 [==============================] - 0s 18ms/step - loss: 4.9917 - accuracy: 0.6700
Epoch 2/15
19/19 [==============================] - 0s 3ms/step - loss: 4.5646 - accuracy: 0.5941
Epoch 3/15
19/19 [==============================] - 0s 4ms/step - loss: 4.0047 - accuracy: 0.6139
Epoch 4/15
19/19 [==============================] - 0s 4ms/step - loss: 3.5297 - accuracy: 0.5974
Epoch 5/15
19/19 [==============================] - 0s 4ms/step - loss: 2.7839 - accuracy: 0.6172
Epoch 6/15
19/19 [==============================] - 0s 4ms/step - loss: 2.1067 - accuracy: 0.6502
Epoch 7/15
19/19 [==============================] - 0s 4ms/step - loss: 1.5788 - accuracy: 0.6601
Epoch 8/15
19/19 [==============================] - 0s 4ms/step - loss: 1.1459 - accuracy: 0.6733
Epoch 9/15
19/19 [==============================] - 0s 4ms/step - loss: 0.8593 - accuracy: 0.6898
Epoch 10/15
19/19 [==============================] - 0s 3ms/step - loss: 0.7095 - accuracy: 0.7426
Epoch 11/15
19/19 [==============================] - 0s 4ms/step - loss: 0.6168 - accuracy: 0.7657
Epoch 12/15
19/19 [==============================] - 0s 3ms/step - loss: 0.5517 - accuracy: 0.7624
Epoch 13/15
19/19 [==============================] - 0s 3ms/step - loss: 0.5121 - accuracy: 0.7789
Epoch 14/15
19/19 [==============================] - 0s 4ms/step - loss: 0.4853 - accuracy: 0.7888
Epoch 15/15
19/19 [==============================] - 0s 3ms/step - loss: 0.4644 - accuracy: 0.7888

<tensorflow.python.keras.callbacks.History at 0x7f23c019a358>