## Introduction

Decision Forests (DF) are a large family of Machine Learning algorithms for supervised classification, regression and ranking. As the name suggests, DFs use decision trees as a building block. Today, the two most popular DF training algorithms are Random Forests and Gradient Boosted Decision Trees. Both algorithms are ensemble techniques that use multiple decision trees, but differ on how they do it.

TensorFlow Decision Forests (TF-DF) is a library for the training, evaluation, interpretation and inference of Decision Forest models.

In this tutorial, you will learn how to:

- Train a binary classification Random Forest on a dataset containing numerical, categorical and missing features.
- Evaluate the model on a test dataset.
- Prepare the model for TensorFlow Serving.
- Examine the overall structure of the model and the importance of each feature.
- Re-train the model with a different learning algorithm (Gradient Boosted Decision Trees).
- Use a different set of input features.
- Change the hyperparameters of the model.
- Preprocess the features.
- Train a model for regression.
- Train a model for ranking.

Detailed documentation is available in the user manual. The example directory contains other end-to-end examples.

## Installing TensorFlow Decision Forests

Install TF-DF by running the following cell.

`pip install tensorflow_decision_forests`

Wurlitzer is needed to display the detailed training logs in Colabs (when using `verbose=2`

in the model constructor).

`pip install wurlitzer`

## Importing libraries

```
import tensorflow_decision_forests as tfdf
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math
```

The hidden code cell limits the output height in colab.

```
# Check the version of TensorFlow Decision Forests
print("Found TensorFlow Decision Forests v" + tfdf.__version__)
```

## Training a Random Forest model

In this section, we train, evaluate, analyse and export a binary classification Random Forest trained on the Palmer's Penguins dataset.

### Load the dataset and convert it in a tf.Dataset

This dataset is very small (300 examples) and stored as a .csv-like file. Therefore, use Pandas to load it.

Let's assemble the dataset into a csv file (i.e. add the header), and load it:

```
# Download the dataset
!wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O /tmp/penguins.csv
# Load a dataset into a Pandas Dataframe.
dataset_df = pd.read_csv("/tmp/penguins.csv")
# Display the first 3 examples.
dataset_df.head(3)
```

The dataset contains a mix of numerical (e.g. `bill_depth_mm`

), categorical
(e.g. `island`

) and missing features. TF-DF supports all these feature types natively (differently than NN based models), therefore there is no need for preprocessing in the form of one-hot encoding, normalization or extra `is_present`

feature.

Labels are a bit different: Keras metrics expect integers. The label (`species`

) is stored as a string, so let's convert it into an integer.

```
# Encode the categorical label into an integer.
#
# Details:
# This stage is necessary if your classification label is represented as a
# string. Note: Keras expected classification labels to be integers.
# Name of the label column.
label = "species"
classes = dataset_df[label].unique().tolist()
print(f"Label classes: {classes}")
dataset_df[label] = dataset_df[label].map(classes.index)
```

Label classes: ['Adelie', 'Gentoo', 'Chinstrap']

Next split the dataset into training and testing:

```
# Split the dataset into a training and a testing dataset.
def split_dataset(dataset, test_ratio=0.30):
"""Splits a panda dataframe in two."""
test_indices = np.random.rand(len(dataset)) < test_ratio
return dataset[~test_indices], dataset[test_indices]
train_ds_pd, test_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples for testing.".format(
len(train_ds_pd), len(test_ds_pd)))
```

236 examples in training, 108 examples for testing.

And finally, convert the pandas dataframe (`pd.Dataframe`

) into tensorflow datasets (`tf.data.Dataset`

):

```
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label)
```

**Notes:** `pd_dataframe_to_tf_dataset`

could have converted the label to integer for you.

And, if you wanted to create the `tf.data.Dataset`

yourself, there is a couple of things to remember:

- The learning algorithms work with a one-epoch dataset and without shuffling.
- The batch size does not impact the training algorithm, but a small value might slow down reading the dataset.

### Train the model

```
%set_cell_height 300
# Specify the model.
model_1 = tfdf.keras.RandomForestModel()
# Train the model.
model_1.fit(x=train_ds)
```

### Remarks

- No input features are specified. Therefore, all the columns will be used as
input features except for the label. The feature used by the model are shown
in the training logs and in the
`model.summary()`

. - DFs consume natively numerical, categorical, categorical-set features and missing-values. Numerical features do not need to be normalized. Categorical string values do not need to be encoded in a dictionary.
- No training hyper-parameters are specified. Therefore the default hyper-parameters will be used. Default hyper-parameters provide reasonable results in most situations.
- Calling
`compile`

on the model before the`fit`

is optional. Compile can be used to provide extra evaluation metrics. - Training algorithms do not need validation datasets. If a validation dataset is provided, it will only be used to show metrics.
- Add a
`verbose`

argument to`RandomForestModel`

to control the amount of displayed training logs. Set`verbose=0`

to hide most of the logs. Set`verbose=2`

to show all the logs.

## Evaluate the model

Let's evaluate our model on the test dataset.

```
model_1.compile(metrics=["accuracy"])
evaluation = model_1.evaluate(test_ds, return_dict=True)
print()
for name, value in evaluation.items():
print(f"{name}: {value:.4f}")
```

1/1 [==============================] - 0s 292ms/step - loss: 0.0000e+00 - accuracy: 1.0000 loss: 0.0000 accuracy: 1.0000

**Remark:** The test accuracy (0.86514) is close to the Out-of-bag accuracy
(0.8672) shown in the training logs.

See the **Model Self Evaluation** section below for more evaluation methods.

## Prepare this model for TensorFlow Serving.

Export the model to the SavedModel format for later re-use e.g. TensorFlow Serving.

```
model_1.save("/tmp/my_saved_model")
```

## Plot the model

Plotting a decision tree and following the first branches helps learning about decision forests. In some cases, plotting a model can even be used for debugging.

Because of the difference in the way they are trained, some models are more interresting to plan than others. Because of the noise injected during training and the depth of the trees, plotting Random Forest is less informative than plotting a CART or the first tree of a Gradient Boosted Tree.

Never the less, let's plot the first tree of our Random Forest model:

```
tfdf.model_plotter.plot_model_in_colab(model_1, tree_idx=0, max_depth=3)
```

The root node on the left contains the first condition (`bill_depth_mm >= 16.55`

), number of examples (240) and label distribution (the red-blue-green bar).

Examples that evaluates true to `bill_depth_mm >= 16.55`

are branched to the green path. The other ones are branched to the red path.

The deeper the node, the more `pure`

they become i.e. the label distribution is biased toward a subset of classes.

## Model tructure and feature importance

The overall structure of the model is show with `.summary()`

. You will see:

**Type**: The learning algorithm used to train the model (`Random Forest`

in our case).**Task**: The problem solved by the model (`Classification`

in our case).**Input Features**: The input features of the model.**Variable Importance**: Different measures of the importance of each feature for the model.**Out-of-bag evaluation**: The out-of-bag evaluation of the model. This is a cheap and efficient alternative to cross-validation.**Number of {trees, nodes} and other metrics**: Statistics about the structure of the decisions forests.

**Remark:** The summary's content depends on the learning algorithm (e.g.
Out-of-bag is only available for Random Forest) and the hyper-parameters (e.g.
the *mean-decrease-in-accuracy* variable importance can be disabled in the
hyper-parameters).

```
%set_cell_height 300
model_1.summary()
```

The information in `summary`

are all available programatically using the model inspector:

```
# The input features
model_1.make_inspector().features()
```

["bill_depth_mm" (1; #0), "bill_length_mm" (1; #1), "body_mass_g" (1; #2), "flipper_length_mm" (1; #3), "island" (4; #4), "sex" (4; #5), "year" (1; #6)]

```
# The feature importances
model_1.make_inspector().variable_importances()
```

{'NUM_AS_ROOT': [("flipper_length_mm" (1; #3), 120.0), ("bill_length_mm" (1; #1), 93.0), ("bill_depth_mm" (1; #0), 70.0), ("island" (4; #4), 16.0), ("body_mass_g" (1; #2), 1.0)], 'MEAN_MIN_DEPTH': [("__LABEL" (4; #7), 3.280960604210599), ("year" (1; #6), 3.2606045898545846), ("sex" (4; #5), 3.207423927923923), ("body_mass_g" (1; #2), 2.809716228216225), ("island" (4; #4), 2.1443225755725748), ("bill_depth_mm" (1; #0), 2.0560545380545374), ("flipper_length_mm" (1; #3), 1.6160606800606798), ("bill_length_mm" (1; #1), 1.2014205146705157)], 'SUM_SCORE': [("bill_length_mm" (1; #1), 26762.987352006137), ("flipper_length_mm" (1; #3), 18170.776540881954), ("bill_depth_mm" (1; #0), 11901.538405206986), ("island" (4; #4), 11507.26736571081), ("body_mass_g" (1; #2), 2007.0397834228352), ("sex" (4; #5), 385.18135726079345), ("year" (1; #6), 44.290259033441544)], 'NUM_NODES': [("bill_length_mm" (1; #1), 715.0), ("bill_depth_mm" (1; #0), 390.0), ("flipper_length_mm" (1; #3), 349.0), ("island" (4; #4), 307.0), ("body_mass_g" (1; #2), 280.0), ("sex" (4; #5), 49.0), ("year" (1; #6), 18.0)]}

The content of the summary and the inspector depends on the learning algorithm (`tfdf.keras.RandomForestModel`

in this case) and its hyper-parameters (e.g. `compute_oob_variable_importances=True`

will trigger the computation of Out-of-bag variable importances for the Random Forest learner).

## Model Self Evaluation

During training TFDF models can self evaluate even if no validation dataset is provided to the `fit()`

method. The exact logic depends on the model. For example, Random Forest will use Out-of-bag evaluation while Gradient Boosted Trees will use internal train-validation.

The model self evaluation is available with the inspector's `evaluation()`

:

```
model_1.make_inspector().evaluation()
```

Evaluation(num_examples=236, accuracy=0.9745762711864406, loss=0.10867847165613735, rmse=None, ndcg=None, aucs=None)

## Plotting the training logs

The training logs show the quality of the model (e.g. accuracy evaluated on the out-of-bag or validation dataset) according to the number of trees in the model. These logs are helpful to study the balance between model size and model quality.

The logs are available in multiple ways:

- Displayed in during training if
`fit()`

is wrapped in`with sys_pipes():`

(see example above). - At the end of the model summary i.e.
`model.summary()`

(see example above). - Programmatically, using the model inspector i.e.
`model.make_inspector().training_logs()`

. - Using TensorBoard

Let's try the options 2 and 3:

```
%set_cell_height 150
model_1.make_inspector().training_logs()
```

Let's plot it:

```
import matplotlib.pyplot as plt
logs = model_1.make_inspector().training_logs()
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot([log.num_trees for log in logs], [log.evaluation.accuracy for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Accuracy (out-of-bag)")
plt.subplot(1, 2, 2)
plt.plot([log.num_trees for log in logs], [log.evaluation.loss for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Logloss (out-of-bag)")
plt.show()
```

This dataset is small. You can see the model converging almost immediately.

Let's use TensorBoard:

```
# This cell start TensorBoard that can be slow.
# Load the TensorBoard notebook extension
%load_ext tensorboard
# Google internal version
# %load_ext google3.learning.brain.tensorboard.notebook.extension
```

`# Clear existing results (if any)`

`rm -fr "/tmp/tensorboard_logs"`

```
# Export the meta-data to tensorboard.
model_1.make_inspector().export_to_tensorboard("/tmp/tensorboard_logs")
```

```
# docs_infra: no_execute
# Start a tensorboard instance.
%tensorboard --logdir "/tmp/tensorboard_logs"
```

## Re-train the model with a different learning algorithm

The learning algorithm is defined by the model class. For
example, `tfdf.keras.RandomForestModel()`

trains a Random Forest, while
`tfdf.keras.GradientBoostedTreesModel()`

trains a Gradient Boosted Decision
Trees.

The learning algorithms are listed by calling `tfdf.keras.get_all_models()`

or in the
learner list.

```
tfdf.keras.get_all_models()
```

[tensorflow_decision_forests.keras.RandomForestModel, tensorflow_decision_forests.keras.GradientBoostedTreesModel, tensorflow_decision_forests.keras.CartModel, tensorflow_decision_forests.keras.DistributedGradientBoostedTreesModel]

The description of the learning algorithms and their hyper-parameters are also available in the API reference and builtin help:

```
# help works anywhere.
help(tfdf.keras.RandomForestModel)
# ? only works in ipython or notebooks, it usually opens on a separate panel.
tfdf.keras.RandomForestModel?
```

## Using a subset of features

The previous example did not specify the features, so all the columns were used as input feature (except for the label). The following example shows how to specify input features.

```
feature_1 = tfdf.keras.FeatureUsage(name="bill_length_mm")
feature_2 = tfdf.keras.FeatureUsage(name="island")
all_features = [feature_1, feature_2]
# Note: This model is only trained with two features. It will not be as good as
# the one trained on all features.
model_2 = tfdf.keras.GradientBoostedTreesModel(
features=all_features, exclude_non_specified_features=True)
model_2.compile(metrics=["accuracy"])
model_2.fit(x=train_ds, validation_data=test_ds)
print(model_2.evaluate(test_ds, return_dict=True))
```

Use /tmp/tmppf8e7g5_ as temporary training directory Starting reading the dataset 1/1 [==============================] - ETA: 0s Dataset read in 0:00:00.218573 Training model Model trained in 0:00:00.242015 Compiling model 1/1 [==============================] - 1s 608ms/step - val_loss: 0.0000e+00 - val_accuracy: 0.9722 [INFO kernel.cc:1153] Loading model from path [INFO kernel.cc:1001] Use fast generic engine 1/1 [==============================] - 0s 79ms/step - loss: 0.0000e+00 - accuracy: 0.9722 {'loss': 0.0, 'accuracy': 0.9722222089767456}

**TF-DF** attaches a **semantics** to each feature. This semantics controls how
the feature is used by the model. The following semantics are currently supported:

**Numerical**: Generally for quantities or counts with full ordering. For example, the age of a person, or the number of items in a bag. Can be a float or an integer. Missing values are represented with float(Nan) or with an empty sparse tensor.**Categorical**: Generally for a type/class in finite set of possible values without ordering. For example, the color RED in the set {RED, BLUE, GREEN}. Can be a string or an integer. Missing values are represented as "" (empty sting), value -2 or with an empty sparse tensor.**Categorical-Set**: A set of categorical values. Great to represent tokenized text. Can be a string or an integer in a sparse tensor or a ragged tensor (recommended). The order/index of each item doesn't matter.

If not specified, the semantics is inferred from the representation type and shown in the training logs:

- int, float (dense or sparse) → Numerical semantics.
- str (dense or sparse) → Categorical semantics
- int, str (ragged) → Categorical-Set semantics

In some cases, the inferred semantics is incorrect. For example: An Enum stored as an integer is semantically categorical, but it will be detected as numerical. In this case, you should specify the semantic argument in the input. The `education_num`

field of the Adult dataset is classical example.

This dataset doesn't contain such a feature. However, for the demonstration, we will make the model treat the `year`

as a categorical feature:

```
%set_cell_height 300
feature_1 = tfdf.keras.FeatureUsage(name="year", semantic=tfdf.keras.FeatureSemantic.CATEGORICAL)
feature_2 = tfdf.keras.FeatureUsage(name="bill_length_mm")
feature_3 = tfdf.keras.FeatureUsage(name="sex")
all_features = [feature_1, feature_2, feature_3]
model_3 = tfdf.keras.GradientBoostedTreesModel(features=all_features, exclude_non_specified_features=True)
model_3.compile( metrics=["accuracy"])
model_3.fit(x=train_ds, validation_data=test_ds)
```

<IPython.core.display.Javascript object> Use /tmp/tmpihvn_e8p as temporary training directory Starting reading the dataset 1/1 [==============================] - ETA: 0s Dataset read in 0:00:00.154245 Training model Model trained in 0:00:00.135197 Compiling model 1/1 [==============================] - 0s 437ms/step - val_loss: 0.0000e+00 - val_accuracy: 0.8148 [INFO kernel.cc:1153] Loading model from path [INFO kernel.cc:1001] Use fast generic engine <keras.callbacks.History at 0x7f64d02b6810>

Note that `year`

is in the list of CATEGORICAL features (unlike the first run).

## Hyper-parameters

**Hyper-parameters** are parameters of the training algorithm that impact
the quality of the final model. They are specified in the model class
constructor. The list of hyper-parameters is visible with the *question mark* colab command (e.g. `?tfdf.keras.GradientBoostedTreesModel`

).

Alternatively, you can find them on the TensorFlow Decision Forest Github or the Yggdrasil Decision Forest documentation.

The default hyper-parameters of each algorithm matches approximatively the initial publication paper. To ensure consistancy, new features and their matching hyper-parameters are always disable by default. That's why it is a good idea to tune your hyper-parameters.

```
# A classical but slighly more complex model.
model_6 = tfdf.keras.GradientBoostedTreesModel(
num_trees=500, growing_strategy="BEST_FIRST_GLOBAL", max_depth=8)
model_6.fit(x=train_ds)
```

Use /tmp/tmpj23_ibou as temporary training directory Starting reading the dataset 1/1 [==============================] - ETA: 0s Dataset read in 0:00:00.097619 Training model Model trained in 0:00:00.228236 Compiling model 1/1 [==============================] - 0s 337ms/step [INFO kernel.cc:1153] Loading model from path [INFO kernel.cc:1001] Use fast generic engine <keras.callbacks.History at 0x7f64d0034490>

```
# A more complex, but possibly, more accurate model.
model_7 = tfdf.keras.GradientBoostedTreesModel(
num_trees=500,
growing_strategy="BEST_FIRST_GLOBAL",
max_depth=8,
split_axis="SPARSE_OBLIQUE",
categorical_algorithm="RANDOM",
)
model_7.fit(x=train_ds)
```

Use /tmp/tmpyh5caajh as temporary training directory Starting reading the dataset WARNING:tensorflow:5 out of the last 5 calls to <function Model.make_train_function.<locals>.train_function at 0x7f60f0e37cb0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. WARNING:tensorflow:5 out of the last 5 calls to <function Model.make_train_function.<locals>.train_function at 0x7f60f0e37cb0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. 1/1 [==============================] - ETA: 0s Dataset read in 0:00:00.103518 Training model Model trained in 0:00:00.186311 Compiling model 1/1 [==============================] - 0s 302ms/step [INFO kernel.cc:1153] Loading model from path [INFO kernel.cc:1001] Use fast generic engine WARNING:tensorflow:5 out of the last 5 calls to <function CoreModel.make_predict_function.<locals>.predict_function_trained at 0x7f60f0df7200> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. WARNING:tensorflow:5 out of the last 5 calls to <function CoreModel.make_predict_function.<locals>.predict_function_trained at 0x7f60f0df7200> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. <keras.callbacks.History at 0x7f60f0e4cad0>

As new training methods are published and implemented, combinaisons of hyper-parameters can emerge as good or almost-always-better than the default parameters. To avoid changing the default hyper-parameter values these good combinaisons are indexed and available as hyper-parameter templates.

For example, the `benchmark_rank1`

template is the best combinaison on our internal benchmarks. Those templates are versioned to allow training configuration stability e.g. `benchmark_rank1@v1`

.

```
# A good template of hyper-parameters.
model_8 = tfdf.keras.GradientBoostedTreesModel(hyperparameter_template="benchmark_rank1")
model_8.fit(x=train_ds)
```

Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}. Use /tmp/tmppji8zs02 as temporary training directory Starting reading the dataset WARNING:tensorflow:6 out of the last 6 calls to <function Model.make_train_function.<locals>.train_function at 0x7f60f0dcfcb0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. WARNING:tensorflow:6 out of the last 6 calls to <function Model.make_train_function.<locals>.train_function at 0x7f60f0dcfcb0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. 1/1 [==============================] - ETA: 0s Dataset read in 0:00:00.099185 Training model Model trained in 0:00:00.064944 Compiling model [INFO kernel.cc:1153] Loading model from path 1/1 [==============================] - 0s 176ms/step WARNING:tensorflow:6 out of the last 6 calls to <function CoreModel.make_predict_function.<locals>.predict_function_trained at 0x7f60f0de7a70> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. [INFO kernel.cc:1001] Use fast generic engine WARNING:tensorflow:6 out of the last 6 calls to <function CoreModel.make_predict_function.<locals>.predict_function_trained at 0x7f60f0de7a70> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. <keras.callbacks.History at 0x7f60f0de12d0>

The available tempaltes are available with `predefined_hyperparameters`

. Note that different learning algorithms have different templates, even if the name is similar.

```
# The hyper-parameter templates of the Gradient Boosted Tree model.
print(tfdf.keras.GradientBoostedTreesModel.predefined_hyperparameters())
```

[HyperParameterTemplate(name='better_default', version=1, parameters={'growing_strategy': 'BEST_FIRST_GLOBAL'}, description='A configuration that is generally better than the default parameters without being more expensive.'), HyperParameterTemplate(name='benchmark_rank1', version=1, parameters={'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}, description='Top ranking hyper-parameters on our benchmark slightly modified to run in reasonable time.')]

## Feature Preprocessing

Pre-processing features is sometimes necessary to consume signals with complex structures, to regularize the model or to apply transfer learning. Pre-processing can be done in one of three ways:

Preprocessing on the Pandas dataframe. This solution is easy to implement and generally suitable for experimentation. However, the pre-processing logic will not be exported in the model by

`model.save()`

.Keras Preprocessing: While more complex than the previous solution, Keras Preprocessing is packaged in the model.

TensorFlow Feature Columns: This API is part of the TF Estimator library (!= Keras) and planned for deprecation. This solution is interesting when using existing preprocessing code.

In the next example, pre-process the `body_mass_g`

feature into `body_mass_kg = body_mass_g / 1000`

. The `bill_length_mm`

is consumed without pre-processing. Note that such
monotonic transformations have generally no impact on decision forest models.

```
%set_cell_height 300
body_mass_g = tf.keras.layers.Input(shape=(1,), name="body_mass_g")
body_mass_kg = body_mass_g / 1000.0
bill_length_mm = tf.keras.layers.Input(shape=(1,), name="bill_length_mm")
raw_inputs = {"body_mass_g": body_mass_g, "bill_length_mm": bill_length_mm}
processed_inputs = {"body_mass_kg": body_mass_kg, "bill_length_mm": bill_length_mm}
# "preprocessor" contains the preprocessing logic.
preprocessor = tf.keras.Model(inputs=raw_inputs, outputs=processed_inputs)
# "model_4" contains both the pre-processing logic and the decision forest.
model_4 = tfdf.keras.RandomForestModel(preprocessing=preprocessor)
model_4.fit(x=train_ds)
model_4.summary()
```

Model: "random_forest_model_1"
_________________________________________________________________
Layer (type)                Output Shape              Param #   
=================================================================
model (Functional)          {'body_mass_kg': (None,   0         
                            1), 'bill_length_mm': (None
                            , 1)}                               
=================================================================
Total params: 1
Trainable params: 0
Non-trainable params: 1
_________________________________________________________________
Type: "RANDOM_FOREST"
Task: CLASSIFICATION
Label: "__LABEL"

Input Features (2):
        bill_length_mm
        body_mass_kg

No weights

Variable Importance: MEAN_MIN_DEPTH: "__LABEL" 4.001979 ################
    2.                  "body_mass_kg" 1.228907 ####
    3.            "bill_length_mm" 0.011190

Variable Importance: NUM_AS_ROOT:
    1.       "bill_length_mm" 297.000000 ################
    2. "body_mass_kg" 3.000000

Variable Importance: NUM_NODES:
    1.       "bill_length_mm" 1658.000000 ################
    2. "body_mass_kg" 1425.000000

Variable Importance: SUM_SCORE:
    1.       "bill_length_mm" 43748.331708 ################
    2.

The following example re-implements the same logic using TensorFlow Feature Columns.

```
def g_to_kg(x):
return x / 1000
feature_columns = [
tf.feature_column.numeric_column("body_mass_g", normalizer_fn=g_to_kg),
tf.feature_column.numeric_column("bill_length_mm"),
]
preprocessing = tf.keras.layers.DenseFeatures(feature_columns)
model_5 = tfdf.keras.RandomForestModel(preprocessing=preprocessing)
model_5.fit(x=train_ds)
```

Use /tmp/tmp7zcurxdh as temporary training directory Starting reading the dataset 1/1 [==============================] - ETA: 0s Dataset read in 0:00:00.094296 Training model Model trained in 0:00:00.021447 Compiling model 1/1 [==============================] - 0s 135ms/step [INFO kernel.cc:1153] Loading model from path [INFO kernel.cc:1001] Use fast generic engine <keras.callbacks.History at 0x7f60f25b2790>

## Training a regression model

The previous example trains a classification model (TF-DF does not differentiate between binary classification and multi-class classification). In the next example, train a regression model on the Abalone dataset. The objective of this dataset is to predict the number of shell's rings of an abalone.

```
# Download the dataset.
!wget -q https://storage.googleapis.com/download.tensorflow.org/data/abalone_raw.csv -O /tmp/abalone.csv
dataset_df = pd.read_csv("/tmp/abalone.csv")
print(dataset_df.head(3))
```

Type LongestShell Diameter Height WholeWeight ShuckedWeight \ 0 M 0.455 0.365 0.095 0.5140 0.2245 1 M 0.350 0.265 0.090 0.2255 0.0995 2 F 0.530 0.420 0.135 0.6770 0.2565 VisceraWeight ShellWeight Rings 0 0.1010 0.15 15 1 0.0485 0.07 7 2 0.1415 0.21 9

```
# Split the dataset into a training and testing dataset.
train_ds_pd, test_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples for testing.".format(
len(train_ds_pd), len(test_ds_pd)))
# Name of the label column.
label = "Rings"
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label, task=tfdf.keras.Task.REGRESSION)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label, task=tfdf.keras.Task.REGRESSION)
```

2901 examples in training, 1276 examples for testing. /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow_decision_forests/keras/core.py:2036: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only features_dataframe = dataframe.drop(label, 1)

```
%set_cell_height 300
# Configure the model.
model_7 = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.REGRESSION)
# Train the model.
model_7.fit(x=train_ds)
```

<IPython.core.display.Javascript object> Use /tmp/tmpmj202ct3 as temporary training directory Starting reading the dataset 1/3 [=========>....................] - ETA: 0s Dataset read in 0:00:00.121706 Training model Model trained in 0:00:00.792651 Compiling model [INFO kernel.cc:1153] Loading model from path 3/3 [==============================] - 2s 755ms/step [INFO kernel.cc:1001] Use fast generic engine <keras.callbacks.History at 0x7f65ecc18c90>

```
# Evaluate the model on the test dataset.
model_7.compile(metrics=["mse"])
evaluation = model_7.evaluate(test_ds, return_dict=True)
print(evaluation)
print()
print(f"MSE: {evaluation['mse']}")
print(f"RMSE: {math.sqrt(evaluation['mse'])}")
```

WARNING:tensorflow:5 out of the last 5 calls to <function CoreModel.make_test_function.<locals>.test_function at 0x7f60f240a560> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. WARNING:tensorflow:5 out of the last 5 calls to <function CoreModel.make_test_function.<locals>.test_function at 0x7f60f240a560> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. 3/3 [==============================] - 0s 34ms/step - loss: 0.0000e+00 - mse: 1.9625 {'loss': 0.0, 'mse': 1.9624943733215332} MSE: 1.9624943733215332 RMSE: 1.4008905643630887

## Training a ranking model

Finaly, after having trained a classification and a regression models, train a ranking model.

The goal of a ranking is to **order** items by importance. The "value" of
relevance does not matter directly. Ranking a set of *documents* with regard to
a user *query* is an example of ranking problem: It is only important to get the right order, where the top documents matter more.

TF-DF expects for ranking datasets to be presented in a "flat" format. A document+query dataset might look like that:

query | document_id | feature_1 | feature_2 | relevance/label |
---|---|---|---|---|

cat | 1 | 0.1 | blue | 4 |

cat | 2 | 0.5 | green | 1 |

cat | 3 | 0.2 | red | 2 |

dog | 4 | NA | red | 0 |

dog | 5 | 0.2 | red | 1 |

dog | 6 | 0.6 | green | 1 |

The *relevance/label* is a floating point numerical value between 0 and 5
(generally between 0 and 4) where 0 means "completely unrelated", 4 means "very
relevant" and 5 means "the same as the query".

Interestingly, decision forests are often good rankers, and many state-of-the-art ranking models are decision forests.

In this example, use a sample of the
LETOR3
dataset. More precisely, we want to download the `OHSUMED.zip`

from the LETOR3 repo. This dataset is stored in the
libsvm format, so we will need to convert it to csv.

```
%set_cell_height 200
archive_path = tf.keras.utils.get_file("letor.zip",
"https://download.microsoft.com/download/E/7/E/E7EABEF1-4C7B-4E31-ACE5-73927950ED5E/Letor.zip",
extract=True)
# Path to the train and test dataset using libsvm format.
raw_dataset_path = os.path.join(os.path.dirname(archive_path),"OHSUMED/Data/All/OHSUMED.txt")
```

<IPython.core.display.Javascript object> Downloading data from https://download.microsoft.com/download/E/7/E/E7EABEF1-4C7B-4E31-ACE5-73927950ED5E/Letor.zip 61825024/61824018 [==============================] - 6s 0us/step 61833216/61824018 [==============================] - 6s 0us/step

The dataset is stored as a .txt file in a specific format, so first convert it into a csv file.

```
def convert_libsvm_to_csv(src_path, dst_path):
"""Converts a libsvm ranking dataset into a flat csv file.
Note: This code is specific to the LETOR3 dataset.
"""
dst_handle = open(dst_path, "w")
first_line = True
for src_line in open(src_path,"r"):
# Note: The last 3 items are comments.
items = src_line.split(" ")[:-3]
relevance = items[0]
group = items[1].split(":")[1]
features = [ item.split(":") for item in items[2:]]
if first_line:
# Csv header
dst_handle.write("relevance,group," + ",".join(["f_" + feature[0] for feature in features]) + "\n")
first_line = False
dst_handle.write(relevance + ",g_" + group + "," + (",".join([feature[1] for feature in features])) + "\n")
dst_handle.close()
# Convert the dataset.
csv_dataset_path="/tmp/ohsumed.csv"
convert_libsvm_to_csv(raw_dataset_path, csv_dataset_path)
# Load a dataset into a Pandas Dataframe.
dataset_df = pd.read_csv(csv_dataset_path)
# Display the first 3 examples.
dataset_df.head(3)
```

```
train_ds_pd, test_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples for testing.".format(
len(train_ds_pd), len(test_ds_pd)))
# Display the first 3 examples of the training dataset.
train_ds_pd.head(3)
```

11318 examples in training, 4822 examples for testing.

In this dataset, the `relevance`

defines the ground-truth rank among rows of the same `group`

.

```
# Name of the relevance and grouping columns.
relevance = "relevance"
ranking_train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=relevance, task=tfdf.keras.Task.RANKING)
ranking_test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=relevance, task=tfdf.keras.Task.RANKING)
```

/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow_decision_forests/keras/core.py:2036: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only features_dataframe = dataframe.drop(label, 1)

```
%set_cell_height 400
model_8 = tfdf.keras.GradientBoostedTreesModel(
task=tfdf.keras.Task.RANKING,
ranking_group="group",
num_trees=50)
model_8.fit(x=ranking_train_ds)
```

<IPython.core.display.Javascript object> Use /tmp/tmplxrwn2da as temporary training directory Starting reading the dataset 9/12 [=====================>........] - ETA: 0s Dataset read in 0:00:00.567589 Training model Model trained in 0:00:01.289335 Compiling model 12/12 [==============================] - 2s 131ms/step [INFO kernel.cc:1153] Loading model from path [INFO abstract_model.cc:1063] Engine "GradientBoostedTreesQuickScorerExtended" built [INFO kernel.cc:1001] Use fast generic engine <keras.callbacks.History at 0x7f60f2392510>

At this point, keras does not propose any ranking metrics. Instead, the training and validation (a GBDT uses a validation dataset) are shown in the training
logs. In this case the loss is `LAMBDA_MART_NDCG5`

, and the final (i.e. at
the end of the training) NDCG (normalized discounted cumulative gain) is `0.510136`

(see line `Final model valid-loss: -0.510136`

).

Note that the NDCG is a value between 0 and 1. The larget the NDCG, the better the model. For this reason, the loss to be -NDCG.

As before, the model can be analysed:

```
%set_cell_height 400
model_8.summary()
```

