![]() |
![]() |
![]() |
![]() |
This tutorial demonstrates how to classify structured data (e.g. tabular data in a CSV). We will use Keras to define the model, and tf.feature_column
as a bridge to map from columns in a CSV to features used to train the model. This tutorial contains complete code to:
- Load a CSV file using Pandas.
- Build an input pipeline to batch and shuffle the rows using tf.data.
- Map from columns in the CSV to features used to train the model using feature columns.
- Build, train, and evaluate a model using Keras.
The Dataset
We will use a simplified version of the PetFinder dataset. There are several thousand rows in the CSV. Each row describes a pet, and each column describes an attribute. We will use this information to predict the speed at which the pet will be adopted.
Following is a description of this dataset. Notice there are both numeric and categorical columns. There is a free text column which we will not use in this tutorial.
Column | Description | Feature Type | Data Type |
---|---|---|---|
Type | Type of animal (Dog, Cat) | Categorical | string |
Age | Age of the pet | Numerical | integer |
Breed1 | Primary breed of the pet | Categorical | string |
Color1 | Color 1 of pet | Categorical | string |
Color2 | Color 2 of pet | Categorical | string |
MaturitySize | Size at maturity | Categorical | string |
FurLength | Fur length | Categorical | string |
Vaccinated | Pet has been vaccinated | Categorical | string |
Sterilized | Pet has been sterilized | Categorical | string |
Health | Health Condition | Categorical | string |
Fee | Adoption Fee | Numerical | integer |
Description | Profile write-up for this pet | Text | string |
PhotoAmt | Total uploaded photos for this pet | Numerical | integer |
AdoptionSpeed | Speed of adoption | Classification | integer |
Import TensorFlow and other libraries
pip install sklearn
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
2022-12-14 03:17:00.597681: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory 2022-12-14 03:17:00.597783: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory 2022-12-14 03:17:00.597793: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Use Pandas to create a dataframe
Pandas is a Python library with many helpful utilities for loading and working with structured data. We will use Pandas to download the dataset from a URL, and load it into a dataframe.
import pathlib
dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'
tf.keras.utils.get_file('petfinder_mini.zip', dataset_url,
extract=True, cache_dir='.')
dataframe = pd.read_csv(csv_file)
Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip 1668792/1668792 [==============================] - 0s 0us/step
dataframe.head()
Create target variable
The task in the original dataset is to predict the speed at which a pet will be adopted (e.g., in the first week, the first month, the first three months, and so on). Let's simplify this for our tutorial. Here, we will transform this into a binary classification problem, and simply predict whether the pet was adopted, or not.
After modifying the label column, 0 will indicate the pet was not adopted, and 1 will indicate it was.
# In the original dataset "4" indicates the pet was not adopted.
dataframe['target'] = np.where(dataframe['AdoptionSpeed']==4, 0, 1)
# Drop un-used columns.
dataframe = dataframe.drop(columns=['AdoptionSpeed', 'Description'])
Split the dataframe into train, validation, and test
The dataset we downloaded was a single CSV file. We will split this into train, validation, and test sets.
train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')
7383 train examples 1846 validation examples 2308 test examples
Create an input pipeline using tf.data
Next, we will wrap the dataframes with tf.data. This will enable us to use feature columns as a bridge to map from the columns in the Pandas dataframe to features used to train the model. If we were working with a very large CSV file (so large that it does not fit into memory), we would use tf.data to read it from disk directly. That is not covered in this tutorial.
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('target')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
return ds
batch_size = 5 # A small batch sized is used for demonstration purposes
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)
Understand the input pipeline
Now that we have created the input pipeline, let's call it to see the format of the data it returns. We have used a small batch size to keep the output readable.
for feature_batch, label_batch in train_ds.take(1):
print('Every feature:', list(feature_batch.keys()))
print('A batch of ages:', feature_batch['Age'])
print('A batch of targets:', label_batch )
Every feature: ['Type', 'Age', 'Breed1', 'Gender', 'Color1', 'Color2', 'MaturitySize', 'FurLength', 'Vaccinated', 'Sterilized', 'Health', 'Fee', 'PhotoAmt'] A batch of ages: tf.Tensor([ 3 5 6 48 60], shape=(5,), dtype=int64) A batch of targets: tf.Tensor([1 1 0 0 0], shape=(5,), dtype=int64)
We can see that the dataset returns a dictionary of column names (from the dataframe) that map to column values from rows in the dataframe.
Demonstrate several types of feature columns
TensorFlow provides many types of feature columns. In this section, we will create several types of feature columns, and demonstrate how they transform a column from the dataframe.
# We will use this batch to demonstrate several types of feature columns
example_batch = next(iter(train_ds))[0]
# A utility method to create a feature column
# and to transform a batch of data
def demo(feature_column):
feature_layer = layers.DenseFeatures(feature_column)
print(feature_layer(example_batch).numpy())
Numeric columns
The output of a feature column becomes the input to the model (using the demo function defined above, we will be able to see exactly how each column from the dataframe is transformed). A numeric column is the simplest type of column. It is used to represent real valued features. When using this column, your model will receive the column value from the dataframe unchanged.
photo_count = feature_column.numeric_column('PhotoAmt')
demo(photo_count)
[[2.] [3.] [2.] [1.] [2.]]
In the PetFinder dataset, most columns from the dataframe are categorical.
Bucketized columns
Often, you don't want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. Consider raw data that represents a person's age. Instead of representing age as a numeric column, we could split the age into several buckets using a bucketized column. Notice the one-hot values below describe which age range each row matches.
age = feature_column.numeric_column('Age')
age_buckets = feature_column.bucketized_column(age, boundaries=[1, 3, 5])
demo(age_buckets)
[[0. 0. 0. 1.] [0. 0. 0. 1.] [0. 0. 0. 1.] [0. 1. 0. 0.] [0. 1. 0. 0.]]
Categorical columns
In this dataset, Type is represented as a string (e.g. 'Dog', or 'Cat'). We cannot feed strings directly to a model. Instead, we must first map them to numeric values. The categorical vocabulary columns provide a way to represent strings as a one-hot vector (much like you have seen above with age buckets). The vocabulary can be passed as a list using categorical_column_with_vocabulary_list, or loaded from a file using categorical_column_with_vocabulary_file.
animal_type = feature_column.categorical_column_with_vocabulary_list(
'Type', ['Cat', 'Dog'])
animal_type_one_hot = feature_column.indicator_column(animal_type)
demo(animal_type_one_hot)
[[0. 1.] [1. 0.] [0. 1.] [0. 1.] [0. 1.]]
Embedding columns
Suppose instead of having just a few possible strings, we have thousands (or more) values per category. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using one-hot encodings. We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an embedding column represents that data as a lower-dimensional, dense vector in which each cell can contain any number, not just 0 or 1. The size of the embedding (8, in the example below) is a parameter that must be tuned.
# Notice the input to the embedding column is the categorical column
# we previously created
breed1 = feature_column.categorical_column_with_vocabulary_list(
'Breed1', dataframe.Breed1.unique())
breed1_embedding = feature_column.embedding_column(breed1, dimension=8)
demo(breed1_embedding)
[[-0.12131222 0.26008067 0.6732559 -0.47476637 0.06515788 -0.23891434 0.11922298 -0.2849519 ] [-0.46422648 -0.24685009 0.04639043 -0.2704345 0.15155672 0.34626234 -0.08356459 -0.04319363] [ 0.23924403 -0.526196 0.17385678 -0.10220847 -0.4520305 -0.37282032 0.39077568 -0.3101155 ] [-0.06447873 -0.05510841 0.4508166 -0.13070141 -0.04830438 -0.2310815 0.3791002 -0.42949012] [-0.06447873 -0.05510841 0.4508166 -0.13070141 -0.04830438 -0.2310815 0.3791002 -0.42949012]]
Hashed feature columns
Another way to represent a categorical column with a large number of values is to use a categorical_column_with_hash_bucket. This feature column calculates a hash value of the input, then selects one of the hash_bucket_size
buckets to encode a string. When using this column, you do not need to provide the vocabulary, and you can choose to make the number of hash_buckets significantly smaller than the number of actual categories to save space.
breed1_hashed = feature_column.categorical_column_with_hash_bucket(
'Breed1', hash_bucket_size=10)
demo(feature_column.indicator_column(breed1_hashed))
[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]
Crossed feature columns
Combining features into a single feature, better known as feature crosses, enables a model to learn separate weights for each combination of features. Here, we will create a new feature that is the cross of Age and Type. Note that crossed_column
does not build the full table of all possible combinations (which could be very large). Instead, it is backed by a hashed_column
, so you can choose how large the table is.
crossed_feature = feature_column.crossed_column([age_buckets, animal_type], hash_bucket_size=10)
demo(feature_column.indicator_column(crossed_feature))
[[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.] [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]
Choose which columns to use
We have seen how to use several types of feature columns. Now we will use them to train a model. The goal of this tutorial is to show you the complete code (e.g. mechanics) needed to work with feature columns. We have selected a few columns to train our model below arbitrarily.
feature_columns = []
# numeric cols
for header in ['PhotoAmt', 'Fee', 'Age']:
feature_columns.append(feature_column.numeric_column(header))
# bucketized cols
age = feature_column.numeric_column('Age')
age_buckets = feature_column.bucketized_column(age, boundaries=[1, 2, 3, 4, 5])
feature_columns.append(age_buckets)
# indicator_columns
indicator_column_names = ['Type', 'Color1', 'Color2', 'Gender', 'MaturitySize',
'FurLength', 'Vaccinated', 'Sterilized', 'Health']
for col_name in indicator_column_names:
categorical_column = feature_column.categorical_column_with_vocabulary_list(
col_name, dataframe[col_name].unique())
indicator_column = feature_column.indicator_column(categorical_column)
feature_columns.append(indicator_column)
# embedding columns
breed1 = feature_column.categorical_column_with_vocabulary_list(
'Breed1', dataframe.Breed1.unique())
breed1_embedding = feature_column.embedding_column(breed1, dimension=8)
feature_columns.append(breed1_embedding)
# crossed columns
age_type_feature = feature_column.crossed_column([age_buckets, animal_type], hash_bucket_size=100)
feature_columns.append(feature_column.indicator_column(age_type_feature))
Create a feature layer
Now that we have defined our feature columns, we will use a DenseFeatures layer to input them to our Keras model.
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)
Earlier, we used a small batch size to demonstrate how feature columns worked. We create a new input pipeline with a larger batch size.
batch_size = 32
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)
Create, compile, and train the model
model = tf.keras.Sequential([
feature_layer,
layers.Dense(128, activation='relu'),
layers.Dense(128, activation='relu'),
layers.Dropout(.1),
layers.Dense(1)
])
model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(train_ds,
validation_data=val_ds,
epochs=10)
Epoch 1/10 WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'Type': <tf.Tensor 'IteratorGetNext:11' shape=(None,) dtype=string>, 'Age': <tf.Tensor 'IteratorGetNext:0' shape=(None,) dtype=int64>, 'Breed1': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=string>, 'Gender': <tf.Tensor 'IteratorGetNext:6' shape=(None,) dtype=string>, 'Color1': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'Color2': <tf.Tensor 'IteratorGetNext:3' shape=(None,) dtype=string>, 'MaturitySize': <tf.Tensor 'IteratorGetNext:8' shape=(None,) dtype=string>, 'FurLength': <tf.Tensor 'IteratorGetNext:5' shape=(None,) dtype=string>, 'Vaccinated': <tf.Tensor 'IteratorGetNext:12' shape=(None,) dtype=string>, 'Sterilized': <tf.Tensor 'IteratorGetNext:10' shape=(None,) dtype=string>, 'Health': <tf.Tensor 'IteratorGetNext:7' shape=(None,) dtype=string>, 'Fee': <tf.Tensor 'IteratorGetNext:4' shape=(None,) dtype=int64>, 'PhotoAmt': <tf.Tensor 'IteratorGetNext:9' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'Type': <tf.Tensor 'IteratorGetNext:11' shape=(None,) dtype=string>, 'Age': <tf.Tensor 'IteratorGetNext:0' shape=(None,) dtype=int64>, 'Breed1': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=string>, 'Gender': <tf.Tensor 'IteratorGetNext:6' shape=(None,) dtype=string>, 'Color1': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'Color2': <tf.Tensor 'IteratorGetNext:3' shape=(None,) dtype=string>, 'MaturitySize': <tf.Tensor 'IteratorGetNext:8' shape=(None,) dtype=string>, 'FurLength': <tf.Tensor 'IteratorGetNext:5' shape=(None,) dtype=string>, 'Vaccinated': <tf.Tensor 'IteratorGetNext:12' shape=(None,) dtype=string>, 'Sterilized': <tf.Tensor 'IteratorGetNext:10' shape=(None,) dtype=string>, 'Health': <tf.Tensor 'IteratorGetNext:7' shape=(None,) dtype=string>, 'Fee': <tf.Tensor 'IteratorGetNext:4' shape=(None,) dtype=int64>, 'PhotoAmt': <tf.Tensor 'IteratorGetNext:9' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. 229/231 [============================>.] - ETA: 0s - loss: 0.6351 - accuracy: 0.6825WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'Type': <tf.Tensor 'IteratorGetNext:11' shape=(None,) dtype=string>, 'Age': <tf.Tensor 'IteratorGetNext:0' shape=(None,) dtype=int64>, 'Breed1': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=string>, 'Gender': <tf.Tensor 'IteratorGetNext:6' shape=(None,) dtype=string>, 'Color1': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'Color2': <tf.Tensor 'IteratorGetNext:3' shape=(None,) dtype=string>, 'MaturitySize': <tf.Tensor 'IteratorGetNext:8' shape=(None,) dtype=string>, 'FurLength': <tf.Tensor 'IteratorGetNext:5' shape=(None,) dtype=string>, 'Vaccinated': <tf.Tensor 'IteratorGetNext:12' shape=(None,) dtype=string>, 'Sterilized': <tf.Tensor 'IteratorGetNext:10' shape=(None,) dtype=string>, 'Health': <tf.Tensor 'IteratorGetNext:7' shape=(None,) dtype=string>, 'Fee': <tf.Tensor 'IteratorGetNext:4' shape=(None,) dtype=int64>, 'PhotoAmt': <tf.Tensor 'IteratorGetNext:9' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. 231/231 [==============================] - 7s 18ms/step - loss: 0.6354 - accuracy: 0.6825 - val_loss: 0.5494 - val_accuracy: 0.7237 Epoch 2/10 231/231 [==============================] - 2s 9ms/step - loss: 0.5589 - accuracy: 0.7119 - val_loss: 0.5177 - val_accuracy: 0.6912 Epoch 3/10 231/231 [==============================] - 2s 8ms/step - loss: 0.5225 - accuracy: 0.7246 - val_loss: 0.5158 - val_accuracy: 0.7221 Epoch 4/10 231/231 [==============================] - 2s 7ms/step - loss: 0.4983 - accuracy: 0.7317 - val_loss: 0.5007 - val_accuracy: 0.7259 Epoch 5/10 231/231 [==============================] - 2s 8ms/step - loss: 0.4934 - accuracy: 0.7344 - val_loss: 0.5110 - val_accuracy: 0.7313 Epoch 6/10 231/231 [==============================] - 2s 8ms/step - loss: 0.4869 - accuracy: 0.7429 - val_loss: 0.5218 - val_accuracy: 0.7259 Epoch 7/10 231/231 [==============================] - 2s 8ms/step - loss: 0.4821 - accuracy: 0.7483 - val_loss: 0.4970 - val_accuracy: 0.7340 Epoch 8/10 231/231 [==============================] - 2s 7ms/step - loss: 0.4756 - accuracy: 0.7487 - val_loss: 0.4988 - val_accuracy: 0.7443 Epoch 9/10 231/231 [==============================] - 2s 7ms/step - loss: 0.4747 - accuracy: 0.7486 - val_loss: 0.4982 - val_accuracy: 0.7476 Epoch 10/10 231/231 [==============================] - 2s 8ms/step - loss: 0.4652 - accuracy: 0.7538 - val_loss: 0.5031 - val_accuracy: 0.7313 <keras.callbacks.History at 0x7f6822f917c0>
loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)
73/73 [==============================] - 0s 5ms/step - loss: 0.5045 - accuracy: 0.7153 Accuracy 0.7153379321098328
Next steps
The best way to learn more about classifying structured data is to try it yourself. We suggest finding another dataset to work with, and training a model to classify it using code similar to the above. To improve accuracy, think carefully about which features to include in your model, and how they should be represented.