Guía completa de capacitación consciente de cuantificación

Ver en TensorFlow.org Ejecutar en Google Colab Ver fuente en GitHub Descargar cuaderno

Bienvenido a la guía completa para el entrenamiento consciente de cuantificación de Keras.

Esta página documenta varios casos de uso y muestra cómo utilizar la API para cada uno. Una vez que sepa qué API necesita, busque los parámetros y los detalles de bajo nivel en los documentos de API .

Se cubren los siguientes casos de uso:

  • Implemente un modelo con cuantificación de 8 bits con estos pasos.
    • Defina un modelo consciente de la cuantificación.
    • Solo para los modelos Keras HDF5, utilice una lógica de deserialización y puntos de control especiales. Por lo demás, la formación es estándar.
    • Cree un modelo cuantificado a partir del que reconoce la cuantificación.
  • Experimente con la cuantificación.
    • Cualquier cosa para la experimentación no tiene una ruta admitida para la implementación.
    • Las capas personalizadas de Keras se someten a experimentación.

Configuración

Para encontrar las API que necesita y comprender los propósitos, puede ejecutar, pero omita la lectura de esta sección.

! pip uninstall -y tensorflow
! pip install -q tf-nightly
! pip install -q tensorflow-model-optimization

import tensorflow as tf
import numpy as np
import tensorflow_model_optimization as tfmot

import tempfile

input_shape = [20]
x_train = np.random.randn(1, 20).astype(np.float32)
y_train = tf.keras.utils.to_categorical(np.random.randn(1), num_classes=20)

def setup_model():
  model = tf.keras.Sequential([
      tf.keras.layers.Dense(20, input_shape=input_shape),
      tf.keras.layers.Flatten()
  ])
  return model

def setup_pretrained_weights():
  model= setup_model()

  model.compile(
      loss=tf.keras.losses.categorical_crossentropy,
      optimizer='adam',
      metrics=['accuracy']
  )

  model.fit(x_train, y_train)

  _, pretrained_weights = tempfile.mkstemp('.tf')

  model.save_weights(pretrained_weights)

  return pretrained_weights

def setup_pretrained_model():
  model = setup_model()
  pretrained_weights = setup_pretrained_weights()
  model.load_weights(pretrained_weights)
  return model

setup_model()
pretrained_weights = setup_pretrained_weights()

Definir modelo consciente de cuantificación

Al definir modelos de las siguientes formas, hay rutas disponibles para la implementación en backends que se enumeran en la página de descripción general . De forma predeterminada, se utiliza la cuantificación de 8 bits.

Cuantizar todo el modelo

Tu caso de uso:

  • Los modelos de subclases no son compatibles.

Consejos para una mejor precisión del modelo:

  • Pruebe "Cuantizar algunas capas" para omitir la cuantificación de las capas que más reducen la precisión.
  • En general, es mejor afinar con el entrenamiento consciente de la cuantificación en lugar de entrenar desde cero.

Para que todo el modelo sea consciente de la cuantificación, aplique tfmot.quantization.keras.quantize_model al modelo.

base_model = setup_model()
base_model.load_weights(pretrained_weights) # optional but recommended for model accuracy

quant_aware_model = tfmot.quantization.keras.quantize_model(base_model)
quant_aware_model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
quantize_layer (QuantizeLaye (None, 20)                3         
_________________________________________________________________
quant_dense_2 (QuantizeWrapp (None, 20)                425       
_________________________________________________________________
quant_flatten_2 (QuantizeWra (None, 20)                1         
=================================================================
Total params: 429
Trainable params: 420
Non-trainable params: 9
_________________________________________________________________

Cuantizar algunas capas

La cuantificación de un modelo puede tener un efecto negativo en la precisión. Puede cuantificar selectivamente capas de un modelo para explorar el equilibrio entre precisión, velocidad y tamaño del modelo.

Tu caso de uso:

  • Para implementar en un backend que solo funciona bien con modelos totalmente cuantificados (por ejemplo, EdgeTPU v1, la mayoría de los DSP), intente "Cuantizar todo el modelo".

Consejos para una mejor precisión del modelo:

  • En general, es mejor afinar con el entrenamiento consciente de la cuantificación en lugar de entrenar desde cero.
  • Intente cuantificar las últimas capas en lugar de las primeras.
  • Evite cuantificar capas críticas (por ejemplo, mecanismo de atención).

En el siguiente ejemplo, cuantifique solo las capas Dense .

# Create a base model
base_model = setup_model()
base_model.load_weights(pretrained_weights) # optional but recommended for model accuracy

# Helper function uses `quantize_annotate_layer` to annotate that only the 
# Dense layers should be quantized.
def apply_quantization_to_dense(layer):
  if isinstance(layer, tf.keras.layers.Dense):
    return tfmot.quantization.keras.quantize_annotate_layer(layer)
  return layer

# Use `tf.keras.models.clone_model` to apply `apply_quantization_to_dense` 
# to the layers of the model.
annotated_model = tf.keras.models.clone_model(
    base_model,
    clone_function=apply_quantization_to_dense,
)

# Now that the Dense layers are annotated,
# `quantize_apply` actually makes the model quantization aware.
quant_aware_model = tfmot.quantization.keras.quantize_apply(annotated_model)
quant_aware_model.summary()
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.iter
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.beta_1
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.beta_2
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.decay
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.learning_rate
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'm' for (root).layer_with_weights-0.kernel
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'm' for (root).layer_with_weights-0.bias
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'v' for (root).layer_with_weights-0.kernel
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'v' for (root).layer_with_weights-0.bias
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
quantize_layer_1 (QuantizeLa (None, 20)                3         
_________________________________________________________________
quant_dense_3 (QuantizeWrapp (None, 20)                425       
_________________________________________________________________
flatten_3 (Flatten)          (None, 20)                0         
=================================================================
Total params: 428
Trainable params: 420
Non-trainable params: 8
_________________________________________________________________

Si bien este ejemplo usó el tipo de capa para decidir qué cuantificar, la forma más fácil de cuantificar una capa en particular es establecer su propiedad de name y buscar ese nombre en clone_function .

print(base_model.layers[0].name)
dense_3

Precisión de modelo más legible pero potencialmente más baja

Esto no es compatible con el ajuste fino con el entrenamiento consciente de cuantificación, por lo que puede ser menos preciso que los ejemplos anteriores.

Ejemplo funcional

# Use `quantize_annotate_layer` to annotate that the `Dense` layer
# should be quantized.
i = tf.keras.Input(shape=(20,))
x = tfmot.quantization.keras.quantize_annotate_layer(tf.keras.layers.Dense(10))(i)
o = tf.keras.layers.Flatten()(x)
annotated_model = tf.keras.Model(inputs=i, outputs=o)

# Use `quantize_apply` to actually make the model quantization aware.
quant_aware_model = tfmot.quantization.keras.quantize_apply(annotated_model)

# For deployment purposes, the tool adds `QuantizeLayer` after `InputLayer` so that the
# quantized model can take in float inputs instead of only uint8.
quant_aware_model.summary()
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 20)]              0         
_________________________________________________________________
quantize_layer_2 (QuantizeLa (None, 20)                3         
_________________________________________________________________
quant_dense_4 (QuantizeWrapp (None, 10)                215       
_________________________________________________________________
flatten_4 (Flatten)          (None, 10)                0         
=================================================================
Total params: 218
Trainable params: 210
Non-trainable params: 8
_________________________________________________________________

Ejemplo secuencial

# Use `quantize_annotate_layer` to annotate that the `Dense` layer
# should be quantized.
annotated_model = tf.keras.Sequential([
  tfmot.quantization.keras.quantize_annotate_layer(tf.keras.layers.Dense(20, input_shape=input_shape)),
  tf.keras.layers.Flatten()
])

# Use `quantize_apply` to actually make the model quantization aware.
quant_aware_model = tfmot.quantization.keras.quantize_apply(annotated_model)

quant_aware_model.summary()
Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
quantize_layer_3 (QuantizeLa (None, 20)                3         
_________________________________________________________________
quant_dense_5 (QuantizeWrapp (None, 20)                425       
_________________________________________________________________
flatten_5 (Flatten)          (None, 20)                0         
=================================================================
Total params: 428
Trainable params: 420
Non-trainable params: 8
_________________________________________________________________

Punto de control y deserialización

Su caso de uso: este código solo es necesario para el formato del modelo HDF5 (no los pesos HDF5 u otros formatos).

# Define the model.
base_model = setup_model()
base_model.load_weights(pretrained_weights) # optional but recommended for model accuracy
quant_aware_model = tfmot.quantization.keras.quantize_model(base_model)

# Save or checkpoint the model.
_, keras_model_file = tempfile.mkstemp('.h5')
quant_aware_model.save(keras_model_file)

# `quantize_scope` is needed for deserializing HDF5 models.
with tfmot.quantization.keras.quantize_scope():
  loaded_model = tf.keras.models.load_model(keras_model_file)

loaded_model.summary()
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.iter
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.beta_1
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.beta_2
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.decay
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.learning_rate
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'm' for (root).layer_with_weights-0.kernel
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'm' for (root).layer_with_weights-0.bias
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'v' for (root).layer_with_weights-0.kernel
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'v' for (root).layer_with_weights-0.bias
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
WARNING:tensorflow:No training configuration found in the save file, so the model was *not* compiled. Compile it manually.
Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
quantize_layer_4 (QuantizeLa (None, 20)                3         
_________________________________________________________________
quant_dense_6 (QuantizeWrapp (None, 20)                425       
_________________________________________________________________
quant_flatten_6 (QuantizeWra (None, 20)                1         
=================================================================
Total params: 429
Trainable params: 420
Non-trainable params: 9
_________________________________________________________________

Crear e implementar un modelo cuantificado

En general, consulte la documentación del backend de implementación que utilizará.

Este es un ejemplo del backend de TFLite.

base_model = setup_pretrained_model()
quant_aware_model = tfmot.quantization.keras.quantize_model(base_model)

# Typically you train the model here.

converter = tf.lite.TFLiteConverter.from_keras_model(quant_aware_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

quantized_tflite_model = converter.convert()
1/1 [==============================] - 0s 229ms/step - loss: 16.1181 - accuracy: 0.0000e+00
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.iter
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.beta_1
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.beta_2
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.decay
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.learning_rate
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'm' for (root).layer_with_weights-0.kernel
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'm' for (root).layer_with_weights-0.bias
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'v' for (root).layer_with_weights-0.kernel
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'v' for (root).layer_with_weights-0.bias
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
WARNING:absl:Found untraced functions such as dense_7_layer_call_and_return_conditional_losses, dense_7_layer_call_fn, flatten_7_layer_call_and_return_conditional_losses, flatten_7_layer_call_fn, dense_7_layer_call_fn while saving (showing 5 of 10). These functions will not be directly callable after loading.
INFO:tensorflow:Assets written to: /tmp/tmph82hcqfi/assets
INFO:tensorflow:Assets written to: /tmp/tmph82hcqfi/assets

Experimente con la cuantificación

Su caso de uso : el uso de las siguientes API significa que no hay una ruta admitida para la implementación. Por ejemplo, la conversión de TFLite y las implementaciones del kernel solo admiten la cuantificación de 8 bits. Las funciones también son experimentales y no están sujetas a compatibilidad con versiones anteriores.

Configuración: DefaultDenseQuantizeConfig

Experimentar requiere usar tfmot.quantization.keras.QuantizeConfig , que describe cómo cuantificar los pesos, activaciones y salidas de una capa.

A continuación se muestra un ejemplo que define el mismo QuantizeConfig utilizado para la capa Dense en los valores predeterminados de la API.

Durante la propagación hacia adelante en este ejemplo, el LastValueQuantizer devuelto en get_weights_and_quantizers se llama con layer.kernel como entrada, produciendo una salida. La salida reemplaza layer.kernel en la propagación directa original de la capa Dense , a través de la lógica definida en set_quantize_weights . La misma idea se aplica a las activaciones y salidas.

LastValueQuantizer = tfmot.quantization.keras.quantizers.LastValueQuantizer
MovingAverageQuantizer = tfmot.quantization.keras.quantizers.MovingAverageQuantizer

class DefaultDenseQuantizeConfig(tfmot.quantization.keras.QuantizeConfig):
    # Configure how to quantize weights.
    def get_weights_and_quantizers(self, layer):
      return [(layer.kernel, LastValueQuantizer(num_bits=8, symmetric=True, narrow_range=False, per_axis=False))]

    # Configure how to quantize activations.
    def get_activations_and_quantizers(self, layer):
      return [(layer.activation, MovingAverageQuantizer(num_bits=8, symmetric=False, narrow_range=False, per_axis=False))]

    def set_quantize_weights(self, layer, quantize_weights):
      # Add this line for each item returned in `get_weights_and_quantizers`
      # , in the same order
      layer.kernel = quantize_weights[0]

    def set_quantize_activations(self, layer, quantize_activations):
      # Add this line for each item returned in `get_activations_and_quantizers`
      # , in the same order.
      layer.activation = quantize_activations[0]

    # Configure how to quantize outputs (may be equivalent to activations).
    def get_output_quantizers(self, layer):
      return []

    def get_config(self):
      return {}

Cuantizar la capa personalizada de Keras

Este ejemplo usa DefaultDenseQuantizeConfig para cuantificar CustomLayer .

La aplicación de la configuración es la misma en todos los casos de uso "Experimentar con cuantificación".

quantize_annotate_layer = tfmot.quantization.keras.quantize_annotate_layer
quantize_annotate_model = tfmot.quantization.keras.quantize_annotate_model
quantize_scope = tfmot.quantization.keras.quantize_scope

class CustomLayer(tf.keras.layers.Dense):
  pass

model = quantize_annotate_model(tf.keras.Sequential([
   quantize_annotate_layer(CustomLayer(20, input_shape=(20,)), DefaultDenseQuantizeConfig()),
   tf.keras.layers.Flatten()
]))

# `quantize_apply` requires mentioning `DefaultDenseQuantizeConfig` with `quantize_scope`
# as well as the custom Keras layer.
with quantize_scope(
  {'DefaultDenseQuantizeConfig': DefaultDenseQuantizeConfig,
   'CustomLayer': CustomLayer}):
  # Use `quantize_apply` to actually make the model quantization aware.
  quant_aware_model = tfmot.quantization.keras.quantize_apply(model)

quant_aware_model.summary()
Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
quantize_layer_6 (QuantizeLa (None, 20)                3         
_________________________________________________________________
quant_custom_layer (Quantize (None, 20)                425       
_________________________________________________________________
quant_flatten_9 (QuantizeWra (None, 20)                1         
=================================================================
Total params: 429
Trainable params: 420
Non-trainable params: 9
_________________________________________________________________

Modificar los parámetros de cuantificación

Error común: cuantificar el sesgo a menos de 32 bits suele dañar demasiado la precisión del modelo.

Este ejemplo modifica la capa Dense para usar 4 bits para sus pesos en lugar de los 8 bits predeterminados. El resto del modelo continúa utilizando valores predeterminados de API.

quantize_annotate_layer = tfmot.quantization.keras.quantize_annotate_layer
quantize_annotate_model = tfmot.quantization.keras.quantize_annotate_model
quantize_scope = tfmot.quantization.keras.quantize_scope

class ModifiedDenseQuantizeConfig(DefaultDenseQuantizeConfig):
    # Configure weights to quantize with 4-bit instead of 8-bits.
    def get_weights_and_quantizers(self, layer):
      return [(layer.kernel, LastValueQuantizer(num_bits=4, symmetric=True, narrow_range=False, per_axis=False))]

La aplicación de la configuración es la misma en todos los casos de uso "Experimentar con cuantificación".

model = quantize_annotate_model(tf.keras.Sequential([
   # Pass in modified `QuantizeConfig` to modify this Dense layer.
   quantize_annotate_layer(tf.keras.layers.Dense(20, input_shape=(20,)), ModifiedDenseQuantizeConfig()),
   tf.keras.layers.Flatten()
]))

# `quantize_apply` requires mentioning `ModifiedDenseQuantizeConfig` with `quantize_scope`:
with quantize_scope(
  {'ModifiedDenseQuantizeConfig': ModifiedDenseQuantizeConfig}):
  # Use `quantize_apply` to actually make the model quantization aware.
  quant_aware_model = tfmot.quantization.keras.quantize_apply(model)

quant_aware_model.summary()
Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
quantize_layer_7 (QuantizeLa (None, 20)                3         
_________________________________________________________________
quant_dense_9 (QuantizeWrapp (None, 20)                425       
_________________________________________________________________
quant_flatten_10 (QuantizeWr (None, 20)                1         
=================================================================
Total params: 429
Trainable params: 420
Non-trainable params: 9
_________________________________________________________________

Modificar partes de la capa para cuantificar

Este ejemplo modifica la capa Dense para omitir la cuantificación de la activación. El resto del modelo continúa utilizando valores predeterminados de API.

quantize_annotate_layer = tfmot.quantization.keras.quantize_annotate_layer
quantize_annotate_model = tfmot.quantization.keras.quantize_annotate_model
quantize_scope = tfmot.quantization.keras.quantize_scope

class ModifiedDenseQuantizeConfig(DefaultDenseQuantizeConfig):
    def get_activations_and_quantizers(self, layer):
      # Skip quantizing activations.
      return []

    def set_quantize_activations(self, layer, quantize_activations):
      # Empty since `get_activaations_and_quantizers` returns
      # an empty list.
      return

La aplicación de la configuración es la misma en todos los casos de uso "Experimentar con cuantificación".

model = quantize_annotate_model(tf.keras.Sequential([
   # Pass in modified `QuantizeConfig` to modify this Dense layer.
   quantize_annotate_layer(tf.keras.layers.Dense(20, input_shape=(20,)), ModifiedDenseQuantizeConfig()),
   tf.keras.layers.Flatten()
]))

# `quantize_apply` requires mentioning `ModifiedDenseQuantizeConfig` with `quantize_scope`:
with quantize_scope(
  {'ModifiedDenseQuantizeConfig': ModifiedDenseQuantizeConfig}):
  # Use `quantize_apply` to actually make the model quantization aware.
  quant_aware_model = tfmot.quantization.keras.quantize_apply(model)

quant_aware_model.summary()
Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
quantize_layer_8 (QuantizeLa (None, 20)                3         
_________________________________________________________________
quant_dense_10 (QuantizeWrap (None, 20)                423       
_________________________________________________________________
quant_flatten_11 (QuantizeWr (None, 20)                1         
=================================================================
Total params: 427
Trainable params: 420
Non-trainable params: 7
_________________________________________________________________

Utilice un algoritmo de cuantificación personalizado

La clase tfmot.quantization.keras.quantizers.Quantizer es un invocable que puede aplicar cualquier algoritmo a sus entradas.

En este ejemplo, las entradas son los pesos y aplicamos las matemáticas en la función FixedRangeQuantizer __call__ a los pesos. En lugar de los valores de ponderaciones originales, la salida de FixedRangeQuantizer ahora se pasa a lo que hubiera utilizado las ponderaciones.

quantize_annotate_layer = tfmot.quantization.keras.quantize_annotate_layer
quantize_annotate_model = tfmot.quantization.keras.quantize_annotate_model
quantize_scope = tfmot.quantization.keras.quantize_scope

class FixedRangeQuantizer(tfmot.quantization.keras.quantizers.Quantizer):
  """Quantizer which forces outputs to be between -1 and 1."""

  def build(self, tensor_shape, name, layer):
    # Not needed. No new TensorFlow variables needed.
    return {}

  def __call__(self, inputs, training, weights, **kwargs):
    return tf.keras.backend.clip(inputs, -1.0, 1.0)

  def get_config(self):
    # Not needed. No __init__ parameters to serialize.
    return {}


class ModifiedDenseQuantizeConfig(DefaultDenseQuantizeConfig):
    # Configure weights to quantize with 4-bit instead of 8-bits.
    def get_weights_and_quantizers(self, layer):
      # Use custom algorithm defined in `FixedRangeQuantizer` instead of default Quantizer.
      return [(layer.kernel, FixedRangeQuantizer())]

La aplicación de la configuración es la misma en todos los casos de uso "Experimentar con cuantificación".

model = quantize_annotate_model(tf.keras.Sequential([
   # Pass in modified `QuantizeConfig` to modify this `Dense` layer.
   quantize_annotate_layer(tf.keras.layers.Dense(20, input_shape=(20,)), ModifiedDenseQuantizeConfig()),
   tf.keras.layers.Flatten()
]))

# `quantize_apply` requires mentioning `ModifiedDenseQuantizeConfig` with `quantize_scope`:
with quantize_scope(
  {'ModifiedDenseQuantizeConfig': ModifiedDenseQuantizeConfig}):
  # Use `quantize_apply` to actually make the model quantization aware.
  quant_aware_model = tfmot.quantization.keras.quantize_apply(model)

quant_aware_model.summary()
Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
quantize_layer_9 (QuantizeLa (None, 20)                3         
_________________________________________________________________
quant_dense_11 (QuantizeWrap (None, 20)                423       
_________________________________________________________________
quant_flatten_12 (QuantizeWr (None, 20)                1         
=================================================================
Total params: 427
Trainable params: 420
Non-trainable params: 7
_________________________________________________________________