TensorFlow Addons Optimizers: ConditionalGradient

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

Overview

This notebook will demonstrate how to use the Conditional Graident Optimizer from the Addons package.

ConditionalGradient

Constraining the parameters of a neural network has been shown to be beneficial in training because of the underlying regularization effects. Often, parameters are constrained via a soft penalty (which never guarantees the constraint satisfaction) or via a projection operation (which is computationally expensive). Conditional gradient (CG) optimizer, on the other hand, enforces the constraints strictly without the need for an expensive projection step. It works by minimizing a linear approximation of the objective within the constraint set. In this notebook, we demonstrate the appliction of Frobenius norm constraint via the CG optimizer on the MNIST dataset. CG is now available as a tensorflow API. More details of the optimizer are available at https://arxiv.org/pdf/1803.06453.pdf

Setup

pip install -q -U tensorflow-addons
import tensorflow as tf
import tensorflow_addons as tfa
from matplotlib import pyplot as plt
# Hyperparameters
batch_size=64
epochs=10

Build the Model

model_1 = tf.keras.Sequential([
    tf.keras.layers.Dense(64, input_shape=(784,), activation='relu', name='dense_1'),
    tf.keras.layers.Dense(64, activation='relu', name='dense_2'),
    tf.keras.layers.Dense(10, activation='softmax', name='predictions'),
])

Prep the Data

# Load MNIST dataset as NumPy arrays
dataset = {}
num_validation = 10000
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Preprocess the data
x_train = x_train.reshape(-1, 784).astype('float32') / 255
x_test = x_test.reshape(-1, 784).astype('float32') / 255

Define a Custom Callback Function

def frobenius_norm(m):
    """This function is to calculate the frobenius norm of the matrix of all
    layer's weight.
  
    Args:
        m: is a list of weights param for each layers.
    """
    total_reduce_sum = 0
    for i in range(len(m)):
        total_reduce_sum = total_reduce_sum + tf.math.reduce_sum(m[i]**2)
    norm = total_reduce_sum**0.5
    return norm
CG_frobenius_norm_of_weight = []
CG_get_weight_norm = tf.keras.callbacks.LambdaCallback(
    on_epoch_end=lambda batch, logs: CG_frobenius_norm_of_weight.append(
        frobenius_norm(model_1.trainable_weights).numpy()))

Train and Evaluate: Using CG as Optimizer

Simply replace typical keras optimizers with the new tfa optimizer

# Compile the model
model_1.compile(
    optimizer=tfa.optimizers.ConditionalGradient(
        learning_rate=0.99949, lambda_=203),  # Utilize TFA optimizer
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=['accuracy'])

history_cg = model_1.fit(
    x_train,
    y_train,
    batch_size=batch_size,
    validation_data=(x_test, y_test),
    epochs=epochs,
    callbacks=[CG_get_weight_norm])
Epoch 1/10
938/938 [==============================] - 3s 3ms/step - loss: 0.3728 - accuracy: 0.8873 - val_loss: 0.2127 - val_accuracy: 0.9369
Epoch 2/10
938/938 [==============================] - 3s 3ms/step - loss: 0.1892 - accuracy: 0.9425 - val_loss: 0.1689 - val_accuracy: 0.9498
Epoch 3/10
938/938 [==============================] - 3s 3ms/step - loss: 0.1490 - accuracy: 0.9545 - val_loss: 0.1471 - val_accuracy: 0.9557
Epoch 4/10
938/938 [==============================] - 3s 3ms/step - loss: 0.1322 - accuracy: 0.9603 - val_loss: 0.1198 - val_accuracy: 0.9627
Epoch 5/10
938/938 [==============================] - 2s 3ms/step - loss: 0.1215 - accuracy: 0.9632 - val_loss: 0.1224 - val_accuracy: 0.9626
Epoch 6/10
938/938 [==============================] - 2s 3ms/step - loss: 0.1153 - accuracy: 0.9661 - val_loss: 0.0946 - val_accuracy: 0.9719
Epoch 7/10
938/938 [==============================] - 2s 2ms/step - loss: 0.1106 - accuracy: 0.9668 - val_loss: 0.0992 - val_accuracy: 0.9689
Epoch 8/10
938/938 [==============================] - 2s 2ms/step - loss: 0.1074 - accuracy: 0.9673 - val_loss: 0.1309 - val_accuracy: 0.9628
Epoch 9/10
938/938 [==============================] - 2s 2ms/step - loss: 0.1045 - accuracy: 0.9685 - val_loss: 0.1162 - val_accuracy: 0.9625
Epoch 10/10
938/938 [==============================] - 2s 2ms/step - loss: 0.1035 - accuracy: 0.9686 - val_loss: 0.1110 - val_accuracy: 0.9672

Train and Evaluate: Using SGD as Optimizer

model_2 = tf.keras.Sequential([
    tf.keras.layers.Dense(64, input_shape=(784,), activation='relu', name='dense_1'),
    tf.keras.layers.Dense(64, activation='relu', name='dense_2'),
    tf.keras.layers.Dense(10, activation='softmax', name='predictions'),
])
SGD_frobenius_norm_of_weight = []
SGD_get_weight_norm = tf.keras.callbacks.LambdaCallback(
    on_epoch_end=lambda batch, logs: SGD_frobenius_norm_of_weight.append(
        frobenius_norm(model_2.trainable_weights).numpy()))
# Compile the model
model_2.compile(
    optimizer=tf.keras.optimizers.SGD(0.01),  # Utilize SGD optimizer
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=['accuracy'])

history_sgd = model_2.fit(
    x_train,
    y_train,
    batch_size=batch_size,
    validation_data=(x_test, y_test),
    epochs=epochs,
    callbacks=[SGD_get_weight_norm])
Epoch 1/10
938/938 [==============================] - 2s 2ms/step - loss: 1.0540 - accuracy: 0.7220 - val_loss: 0.4450 - val_accuracy: 0.8829
Epoch 2/10
938/938 [==============================] - 2s 2ms/step - loss: 0.3895 - accuracy: 0.8908 - val_loss: 0.3269 - val_accuracy: 0.9049
Epoch 3/10
938/938 [==============================] - 2s 2ms/step - loss: 0.3191 - accuracy: 0.9086 - val_loss: 0.2838 - val_accuracy: 0.9187
Epoch 4/10
938/938 [==============================] - 2s 2ms/step - loss: 0.2843 - accuracy: 0.9187 - val_loss: 0.2608 - val_accuracy: 0.9252
Epoch 5/10
938/938 [==============================] - 2s 2ms/step - loss: 0.2602 - accuracy: 0.9255 - val_loss: 0.2430 - val_accuracy: 0.9294
Epoch 6/10
938/938 [==============================] - 2s 2ms/step - loss: 0.2415 - accuracy: 0.9304 - val_loss: 0.2279 - val_accuracy: 0.9333
Epoch 7/10
938/938 [==============================] - 2s 2ms/step - loss: 0.2261 - accuracy: 0.9346 - val_loss: 0.2142 - val_accuracy: 0.9379
Epoch 8/10
938/938 [==============================] - 2s 2ms/step - loss: 0.2129 - accuracy: 0.9386 - val_loss: 0.2052 - val_accuracy: 0.9401
Epoch 9/10
938/938 [==============================] - 2s 2ms/step - loss: 0.2013 - accuracy: 0.9418 - val_loss: 0.1950 - val_accuracy: 0.9440
Epoch 10/10
938/938 [==============================] - 2s 2ms/step - loss: 0.1910 - accuracy: 0.9445 - val_loss: 0.1882 - val_accuracy: 0.9461

Frobenius Norm of Weights: CG vs SGD

The current implementation of CG optimizer is based on Frobenius Norm, with considering Frobenius Norm as regularizer in the target function. Therefore, we compare CG’s regularized effect with SGD optimizer, which has not imposed Frobenius Norm regularizer.

plt.plot(
    CG_frobenius_norm_of_weight,
    color='r',
    label='CG_frobenius_norm_of_weights')
plt.plot(
    SGD_frobenius_norm_of_weight,
    color='b',
    label='SGD_frobenius_norm_of_weights')
plt.xlabel('Epoch')
plt.ylabel('Frobenius norm of weights')
plt.legend(loc=1)
<matplotlib.legend.Legend at 0x7fd6701afdd8>

png

Train and Validation Accuracy: CG vs SGD

plt.plot(history_cg.history['accuracy'], color='r', label='CG_train')
plt.plot(history_cg.history['val_accuracy'], color='g', label='CG_test')
plt.plot(history_sgd.history['accuracy'], color='pink', label='SGD_train')
plt.plot(history_sgd.history['val_accuracy'], color='b', label='SGD_test')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc=4)
<matplotlib.legend.Legend at 0x7fd6700c6c18>

png