![]() |
![]() |
![]() |
![]() |
![]() |
Welcome to the Intermediate Colab for TensorFlow Decision Forests (TF-DF). In this colab, you will learn about some more advanced capabilities of TF-DF, including how to deal with natural language features.
This colab assumes you are familiar with the concepts presented the Beginner colab, notably about the installation about TF-DF.
In this colab, you will:
Train a Random Forest that consumes text features natively as categorical sets.
Train a Random Forest that consumes text features using a TensorFlow Hub module. In this setting (transfer learning), the module is already pre-trained on a large text corpus.
Train a Gradient Boosted Decision Trees (GBDT) and a Neural Network together. The GBDT will consume the output of the Neural Network.
Setup
# Install TensorFlow Dececision Forests
pip install tensorflow_decision_forests
Wurlitzer is needed to display the detailed training logs in Colabs (when using verbose=2
in the model constructor).
pip install wurlitzer
Import the necessary libraries.
import tensorflow_decision_forests as tfdf
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math
2023-11-20 12:31:20.226021: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-11-20 12:31:20.226066: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-11-20 12:31:20.227643: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
The hidden code cell limits the output height in colab.
from IPython.core.magic import register_line_magic
from IPython.display import Javascript
from IPython.display import display as ipy_display
# Some of the model training logs can cover the full
# screen if not compressed to a smaller viewport.
# This magic allows setting a max height for a cell.
@register_line_magic
def set_cell_height(size):
ipy_display(
Javascript("google.colab.output.setIframeHeight(0, true, {maxHeight: " +
str(size) + "})"))
Use raw text as features
TF-DF can consume categorical-set features natively. Categorical-sets represent text features as bags of words (or n-grams).
For example: "The little blue dog"
→ {"the", "little", "blue", "dog"}
In this example, you'll will train a Random Forest on the Stanford Sentiment Treebank (SST) dataset. The objective of this dataset is to classify sentences as carrying a positive or negative sentiment. You'll will use the binary classification version of the dataset curated in TensorFlow Datasets.
# Install the TensorFlow Datasets package
pip install tensorflow-datasets -U --quiet
# Load the dataset
import tensorflow_datasets as tfds
all_ds = tfds.load("glue/sst2")
# Display the first 3 examples of the test fold.
for example in all_ds["test"].take(3):
print({attr_name: attr_tensor.numpy() for attr_name, attr_tensor in example.items()})
{'idx': 163, 'label': -1, 'sentence': b'not even the hanson brothers can save it'} {'idx': 131, 'label': -1, 'sentence': b'strong setup and ambitious goals fade as the film descends into unsophisticated scare tactics and b-film thuggery .'} {'idx': 1579, 'label': -1, 'sentence': b'too timid to bring a sense of closure to an ugly chapter of the twentieth century .'} 2023-11-20 12:31:28.022927: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
The dataset is modified as follows:
- The raw labels are integers in
{-1, 1}
, but the learning algorithm expects positive integer labels e.g.{0, 1}
. Therefore, the labels are transformed as follows:new_labels = (original_labels + 1) / 2
. - A batch-size of 64 is applied to make reading the dataset more efficient.
- The
sentence
attribute needs to be tokenized, i.e."hello world" -> ["hello", "world"]
.
Details: Some decision forest learning algorithms do not need a validation dataset (e.g. Random Forests) while others do (e.g. Gradient Boosted Trees in some cases). Since each learning algorithm under TF-DF can use validation data differently, TF-DF handles train/validation splits internally. As a result, when you have a training and validation sets, they can always be concatenated as input to the learning algorithm.
def prepare_dataset(example):
label = (example["label"] + 1) // 2
return {"sentence" : tf.strings.split(example["sentence"])}, label
train_ds = all_ds["train"].batch(100).map(prepare_dataset)
test_ds = all_ds["validation"].batch(100).map(prepare_dataset)
Finally, train and evaluate the model as usual. TF-DF automatically detects multi-valued categorical features as categorical-set.
%set_cell_height 300
# Specify the model.
model_1 = tfdf.keras.RandomForestModel(num_trees=30, verbose=2)
# Train the model.
model_1.fit(x=train_ds)
<IPython.core.display.Javascript object> Warning: The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=32 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus. WARNING:absl:The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=32 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus. Use /tmpfs/tmp/tmpp9alip3z as temporary training directory Reading training dataset... Training tensor examples: Features: {'sentence': tf.RaggedTensor(values=Tensor("data:0", shape=(None,), dtype=string), row_splits=Tensor("data_1:0", shape=(None,), dtype=int64))} Label: Tensor("data_2:0", shape=(None,), dtype=int64) Weights: None Normalized tensor features: {'sentence': SemanticTensor(semantic=<Semantic.CATEGORICAL_SET: 4>, tensor=tf.RaggedTensor(values=Tensor("data:0", shape=(None,), dtype=string), row_splits=Tensor("data_1:0", shape=(None,), dtype=int64)))} Training dataset read in 0:00:04.588912. Found 67349 examples. Training model... Standard output detected as not visible to the user e.g. running in a notebook. Creating a training log redirection. If training gets stuck, try calling tfdf.keras.set_training_logs_redirection(False). [INFO 23-11-20 12:31:32.7845 UTC kernel.cc:771] Start Yggdrasil model training [INFO 23-11-20 12:31:32.7845 UTC kernel.cc:772] Collect training examples [INFO 23-11-20 12:31:32.7846 UTC kernel.cc:785] Dataspec guide: column_guides { column_name_pattern: "^__LABEL$" type: CATEGORICAL categorial { min_vocab_frequency: 0 max_vocab_count: -1 } } default_column_guide { categorial { max_vocab_count: 2000 } discretized_numerical { maximum_num_bins: 255 } } ignore_columns_without_guides: false detect_numerical_as_discretized_numerical: false [INFO 23-11-20 12:31:32.7849 UTC kernel.cc:391] Number of batches: 674 [INFO 23-11-20 12:31:32.7849 UTC kernel.cc:392] Number of examples: 67349 [INFO 23-11-20 12:31:32.8290 UTC data_spec_inference.cc:305] 12816 item(s) have been pruned (i.e. they are considered out of dictionary) for the column sentence (2000 item(s) left) because min_value_count=5 and max_number_of_unique_values=2000 [INFO 23-11-20 12:31:32.8820 UTC kernel.cc:792] Training dataset: Number of records: 67349 Number of columns: 2 Number of columns by type: CATEGORICAL_SET: 1 (50%) CATEGORICAL: 1 (50%) Columns: CATEGORICAL_SET: 1 (50%) 1: "sentence" CATEGORICAL_SET has-dict vocab-size:2001 num-oods:10187 (15.1257%) most-frequent:"the" 27205 (40.3941%) CATEGORICAL: 1 (50%) 0: "__LABEL" CATEGORICAL integerized vocab-size:3 no-ood-item Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values. [INFO 23-11-20 12:31:32.8821 UTC kernel.cc:808] Configure learner [INFO 23-11-20 12:31:32.8823 UTC kernel.cc:822] Training config: learner: "RANDOM_FOREST" features: "^sentence$" label: "^__LABEL$" task: CLASSIFICATION random_seed: 123456 metadata { framework: "TF Keras" } pure_serving_model: false [yggdrasil_decision_forests.model.random_forest.proto.random_forest_config] { num_trees: 30 decision_tree { max_depth: 16 min_examples: 5 in_split_min_examples_check: true keep_non_leaf_label_distribution: true num_candidate_attributes: 0 missing_value_policy: GLOBAL_IMPUTATION allow_na_conditions: false categorical_set_greedy_forward { sampling: 0.1 max_num_items: -1 min_item_frequency: 1 } growing_strategy_local { } categorical { cart { } } axis_aligned_split { } internal { sorting_strategy: PRESORTED } uplift { min_examples_in_treatment: 5 split_score: KULLBACK_LEIBLER } } winner_take_all_inference: true compute_oob_performances: true compute_oob_variable_importances: false num_oob_variable_importances_permutations: 1 bootstrap_training_dataset: true bootstrap_size_ratio: 1 adapt_bootstrap_size_ratio_for_maximum_training_duration: false sampling_with_replacement: true } [INFO 23-11-20 12:31:32.8826 UTC kernel.cc:825] Deployment config: cache_path: "/tmpfs/tmp/tmpp9alip3z/working_cache" num_threads: 32 try_resume_training: true [INFO 23-11-20 12:31:32.8828 UTC kernel.cc:887] Train model [INFO 23-11-20 12:31:32.8836 UTC random_forest.cc:416] Training random forest on 67349 example(s) and 1 feature(s). [INFO 23-11-20 12:32:02.2437 UTC random_forest.cc:802] Training of tree 1/30 (tree index:13) done accuracy:0.738731 logloss:9.4171 [INFO 23-11-20 12:32:12.3428 UTC random_forest.cc:802] Training of tree 3/30 (tree index:27) done accuracy:0.754745 logloss:6.47525 [INFO 23-11-20 12:32:17.6546 UTC random_forest.cc:802] Training of tree 13/30 (tree index:20) done accuracy:0.801813 logloss:2.334 [INFO 23-11-20 12:32:18.5584 UTC random_forest.cc:802] Training of tree 23/30 (tree index:15) done accuracy:0.81742 logloss:0.942096 [INFO 23-11-20 12:32:21.9457 UTC random_forest.cc:802] Training of tree 30/30 (tree index:21) done accuracy:0.821274 logloss:0.854486 [INFO 23-11-20 12:32:21.9462 UTC random_forest.cc:882] Final OOB metrics: accuracy:0.821274 logloss:0.854486 [INFO 23-11-20 12:32:21.9558 UTC kernel.cc:919] Export model in log directory: /tmpfs/tmp/tmpp9alip3z with prefix d2f2a624a65443d5 [INFO 23-11-20 12:32:21.9870 UTC kernel.cc:937] Save model in resources [INFO 23-11-20 12:32:21.9901 UTC abstract_model.cc:881] Model self evaluation: Number of predictions (without weights): 67349 Number of predictions (with weights): 67349 Task: CLASSIFICATION Label: __LABEL Accuracy: 0.821274 CI95[W][0.818828 0.8237] LogLoss: : 0.854486 ErrorRate: : 0.178726 Default Accuracy: : 0.557826 Default LogLoss: : 0.686445 Default ErrorRate: : 0.442174 Confusion Table: truth\prediction 1 2 1 19593 10187 2 1850 35719 Total: 67349 [INFO 23-11-20 12:32:22.0155 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmpp9alip3z/model/ with prefix d2f2a624a65443d5 [INFO 23-11-20 12:32:22.3248 UTC decision_forest.cc:660] Model loaded with 30 root(s), 43180 node(s), and 1 input feature(s). [INFO 23-11-20 12:32:22.3249 UTC abstract_model.cc:1344] Engine "RandomForestGeneric" built [INFO 23-11-20 12:32:22.3249 UTC kernel.cc:1061] Use fast generic engine Model trained in 0:00:49.561739 Compiling model... Model compiled. <keras.src.callbacks.History at 0x7fd79650ec70>
In the previous logs, note that sentence
is a CATEGORICAL_SET
feature.
The model is evaluated as usual:
model_1.compile(metrics=["accuracy"])
evaluation = model_1.evaluate(test_ds)
print(f"BinaryCrossentropyloss: {evaluation[0]}")
print(f"Accuracy: {evaluation[1]}")
9/9 [==============================] - 1s 5ms/step - loss: 0.0000e+00 - accuracy: 0.7638 BinaryCrossentropyloss: 0.0 Accuracy: 0.7637614607810974
The training logs looks are follow:
import matplotlib.pyplot as plt
logs = model_1.make_inspector().training_logs()
plt.plot([log.num_trees for log in logs], [log.evaluation.accuracy for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Out-of-bag accuracy")
pass
More trees would probably be beneficial (I am sure of it because I tried :p).
Use a pretrained text embedding
The previous example trained a Random Forest using raw text features. This example will use a pre-trained TF-Hub embedding to convert text features into a dense embedding, and then train a Random Forest on top of it. In this situation, the Random Forest will only "see" the numerical output of the embedding (i.e. it will not see the raw text).
In this experiment, will use the Universal-Sentence-Encoder. Different pre-trained embeddings might be suited for different types of text (e.g. different language, different task) but also for other type of structured features (e.g. images).
The embedding module can be applied in one of two places:
- During the dataset preparation.
- In the pre-processing stage of the model.
The second option is often preferable: Packaging the embedding in the model makes the model easier to use (and harder to misuse).
First install TF-Hub:
pip install --upgrade tensorflow-hub
Unlike before, you don't need to tokenize the text.
def prepare_dataset(example):
label = (example["label"] + 1) // 2
return {"sentence" : example["sentence"]}, label
train_ds = all_ds["train"].batch(100).map(prepare_dataset)
test_ds = all_ds["validation"].batch(100).map(prepare_dataset)
%set_cell_height 300
import tensorflow_hub as hub
# NNLM (https://tfhub.dev/google/nnlm-en-dim128/2) is also a good choice.
hub_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
embedding = hub.KerasLayer(hub_url)
sentence = tf.keras.layers.Input(shape=(), name="sentence", dtype=tf.string)
embedded_sentence = embedding(sentence)
raw_inputs = {"sentence": sentence}
processed_inputs = {"embedded_sentence": embedded_sentence}
preprocessor = tf.keras.Model(inputs=raw_inputs, outputs=processed_inputs)
model_2 = tfdf.keras.RandomForestModel(
preprocessing=preprocessor,
num_trees=100)
model_2.fit(x=train_ds)
<IPython.core.display.Javascript object> Warning: The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=32 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus. WARNING:absl:The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=32 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus. Use /tmpfs/tmp/tmp2l8qenh8 as temporary training directory Reading training dataset... Training dataset read in 0:00:22.682140. Found 67349 examples. Training model... [INFO 23-11-20 12:33:16.6995 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmp2l8qenh8/model/ with prefix a883bbf674954d64 Model trained in 0:00:14.090027 Compiling model... [INFO 23-11-20 12:33:18.4993 UTC decision_forest.cc:660] Model loaded with 100 root(s), 563552 node(s), and 512 input feature(s). [INFO 23-11-20 12:33:18.4994 UTC abstract_model.cc:1344] Engine "RandomForestOptPred" built [INFO 23-11-20 12:33:18.4996 UTC kernel.cc:1061] Use fast generic engine Model compiled. <keras.src.callbacks.History at 0x7fd690629e50>
model_2.compile(metrics=["accuracy"])
evaluation = model_2.evaluate(test_ds)
print(f"BinaryCrossentropyloss: {evaluation[0]}")
print(f"Accuracy: {evaluation[1]}")
9/9 [==============================] - 2s 18ms/step - loss: 0.0000e+00 - accuracy: 0.7798 BinaryCrossentropyloss: 0.0 Accuracy: 0.7798165082931519
Note that categorical sets represent text differently from a dense embedding, so it may be useful to use both strategies jointly.
Train a decision tree and neural network together
The previous example used a pre-trained Neural Network (NN) to process the text features before passing them to the Random Forest. This example will train both the Neural Network and the Random Forest from scratch.
TF-DF's Decision Forests do not back-propagate gradients (although this is the subject of ongoing research). Therefore, the training happens in two stages:
- Train the neural-network as a standard classification task:
example → [Normalize] → [Neural Network*] → [classification head] → prediction
*: Training.
- Replace the Neural Network's head (the last layer and the soft-max) with a Random Forest. Train the Random Forest as usual:
example → [Normalize] → [Neural Network] → [Random Forest*] → prediction
*: Training.
Prepare the dataset
This example uses the Palmer's Penguins dataset. See the Beginner colab for details.
First, download the raw data:
wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O /tmp/penguins.csv
Load a dataset into a Pandas Dataframe.
dataset_df = pd.read_csv("/tmp/penguins.csv")
# Display the first 3 examples.
dataset_df.head(3)
Prepare the dataset for training.
label = "species"
# Replaces numerical NaN (representing missing values in Pandas Dataframe) with 0s.
# ...Neural Nets don't work well with numerical NaNs.
for col in dataset_df.columns:
if dataset_df[col].dtype not in [str, object]:
dataset_df[col] = dataset_df[col].fillna(0)
# Split the dataset into a training and testing dataset.
def split_dataset(dataset, test_ratio=0.30):
"""Splits a panda dataframe in two."""
test_indices = np.random.rand(len(dataset)) < test_ratio
return dataset[~test_indices], dataset[test_indices]
train_ds_pd, test_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples for testing.".format(
len(train_ds_pd), len(test_ds_pd)))
# Convert the datasets into tensorflow datasets
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label)
248 examples in training, 96 examples for testing.
Build the models
Next create the neural network model using Keras' functional style.
To keep the example simple this model only uses two inputs.
input_1 = tf.keras.Input(shape=(1,), name="bill_length_mm", dtype="float")
input_2 = tf.keras.Input(shape=(1,), name="island", dtype="string")
nn_raw_inputs = [input_1, input_2]
Use preprocessing layers to convert the raw inputs to inputs appropriate for the neural network.
# Normalization.
Normalization = tf.keras.layers.Normalization
CategoryEncoding = tf.keras.layers.CategoryEncoding
StringLookup = tf.keras.layers.StringLookup
values = train_ds_pd["bill_length_mm"].values[:, tf.newaxis]
input_1_normalizer = Normalization()
input_1_normalizer.adapt(values)
values = train_ds_pd["island"].values
input_2_indexer = StringLookup(max_tokens=32)
input_2_indexer.adapt(values)
input_2_onehot = CategoryEncoding(output_mode="binary", max_tokens=32)
normalized_input_1 = input_1_normalizer(input_1)
normalized_input_2 = input_2_onehot(input_2_indexer(input_2))
nn_processed_inputs = [normalized_input_1, normalized_input_2]
WARNING:tensorflow:max_tokens is deprecated, please use num_tokens instead. WARNING:tensorflow:max_tokens is deprecated, please use num_tokens instead.
Build the body of the neural network:
y = tf.keras.layers.Concatenate()(nn_processed_inputs)
y = tf.keras.layers.Dense(16, activation=tf.nn.relu6)(y)
last_layer = tf.keras.layers.Dense(8, activation=tf.nn.relu, name="last")(y)
# "3" for the three label classes. If it were a binary classification, the
# output dim would be 1.
classification_output = tf.keras.layers.Dense(3)(y)
nn_model = tf.keras.models.Model(nn_raw_inputs, classification_output)
This nn_model
directly produces classification logits.
Next create a decision forest model. This will operate on the high level features that the neural network extracts in the last layer before that classification head.
# To reduce the risk of mistakes, group both the decision forest and the
# neural network in a single keras model.
nn_without_head = tf.keras.models.Model(inputs=nn_model.inputs, outputs=last_layer)
df_and_nn_model = tfdf.keras.RandomForestModel(preprocessing=nn_without_head)
Warning: The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=32 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus. WARNING:absl:The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=32 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus. Use /tmpfs/tmp/tmpzwv9a980 as temporary training directory
Train and evaluate the models
The model will be trained in two stages. First train the neural network with its own classification head:
%set_cell_height 300
nn_model.compile(
optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=["accuracy"])
nn_model.fit(x=train_ds, validation_data=test_ds, epochs=10)
nn_model.summary()
<IPython.core.display.Javascript object> Epoch 1/10 /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/keras/src/engine/functional.py:642: UserWarning: Input dict contained keys ['bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year'] which did not match any model input. They will be ignored by the model. inputs = self._flatten_to_reference_inputs(inputs) WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1700483606.110085 457876 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. 1/1 [==============================] - 2s 2s/step - loss: 1.4043 - accuracy: 0.0161 - val_loss: 1.3502 - val_accuracy: 0.0208 Epoch 2/10 1/1 [==============================] - 0s 21ms/step - loss: 1.3963 - accuracy: 0.0161 - val_loss: 1.3436 - val_accuracy: 0.0208 Epoch 3/10 1/1 [==============================] - 0s 21ms/step - loss: 1.3885 - accuracy: 0.0161 - val_loss: 1.3371 - val_accuracy: 0.0208 Epoch 4/10 1/1 [==============================] - 0s 21ms/step - loss: 1.3809 - accuracy: 0.0161 - val_loss: 1.3305 - val_accuracy: 0.0208 Epoch 5/10 1/1 [==============================] - 0s 21ms/step - loss: 1.3733 - accuracy: 0.0161 - val_loss: 1.3241 - val_accuracy: 0.0208 Epoch 6/10 1/1 [==============================] - 0s 20ms/step - loss: 1.3658 - accuracy: 0.0161 - val_loss: 1.3177 - val_accuracy: 0.0208 Epoch 7/10 1/1 [==============================] - 0s 20ms/step - loss: 1.3584 - accuracy: 0.0161 - val_loss: 1.3113 - val_accuracy: 0.0208 Epoch 8/10 1/1 [==============================] - 0s 20ms/step - loss: 1.3511 - accuracy: 0.0081 - val_loss: 1.3050 - val_accuracy: 0.0208 Epoch 9/10 1/1 [==============================] - 0s 21ms/step - loss: 1.3440 - accuracy: 0.0081 - val_loss: 1.2988 - val_accuracy: 0.0208 Epoch 10/10 1/1 [==============================] - 0s 21ms/step - loss: 1.3369 - accuracy: 0.0121 - val_loss: 1.2927 - val_accuracy: 0.0312 Model: "model_1" __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== island (InputLayer) [(None, 1)] 0 [] bill_length_mm (InputLayer [(None, 1)] 0 [] ) string_lookup (StringLooku (None, 1) 0 ['island[0][0]'] p) normalization (Normalizati (None, 1) 3 ['bill_length_mm[0][0]'] on) category_encoding (Categor (None, 32) 0 ['string_lookup[0][0]'] yEncoding) concatenate (Concatenate) (None, 33) 0 ['normalization[0][0]', 'category_encoding[0][0]'] dense (Dense) (None, 16) 544 ['concatenate[0][0]'] dense_1 (Dense) (None, 3) 51 ['dense[0][0]'] ================================================================================================== Total params: 598 (2.34 KB) Trainable params: 595 (2.32 KB) Non-trainable params: 3 (16.00 Byte) __________________________________________________________________________________________________
The neural network layers are shared between the two models. So now that the neural network is trained the decision forest model will be fit to the trained output of the neural network layers:
%set_cell_height 300
df_and_nn_model.fit(x=train_ds)
<IPython.core.display.Javascript object> Reading training dataset... Training dataset read in 0:00:00.293304. Found 248 examples. Training model... Model trained in 0:00:00.045032 Compiling model... Model compiled. [INFO 23-11-20 12:33:27.2559 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmpzwv9a980/model/ with prefix 3397b294ee2f42a4 [INFO 23-11-20 12:33:27.2721 UTC decision_forest.cc:660] Model loaded with 300 root(s), 5280 node(s), and 7 input feature(s). [INFO 23-11-20 12:33:27.2721 UTC kernel.cc:1061] Use fast generic engine <keras.src.callbacks.History at 0x7fd6a07d5ac0>
Now evaluate the composed model:
df_and_nn_model.compile(metrics=["accuracy"])
print("Evaluation:", df_and_nn_model.evaluate(test_ds))
1/1 [==============================] - 0s 162ms/step - loss: 0.0000e+00 - accuracy: 0.9479 Evaluation: [0.0, 0.9479166865348816]
Compare it to the Neural Network alone:
print("Evaluation :", nn_model.evaluate(test_ds))
1/1 [==============================] - 0s 13ms/step - loss: 1.2927 - accuracy: 0.0312 Evaluation : [1.2926578521728516, 0.03125]