Weight clustering comprehensive guide

View on TensorFlow.org

Run in Google Colab

View source on GitHub

Download notebook

Welcome to the comprehensive guide for weight clustering, part of the TensorFlow Model Optimization toolkit.

This page documents various use cases and shows how to use the API for each one. Once you know which APIs you need, find the parameters and the low-level details in the API docs:

If you want to see the benefits of weight clustering and what's supported, check the overview.
For a single end-to-end example, see the weight clustering example.

In this guide, the following use cases are covered:

Define a clustered model.
Checkpoint and deserialize a clustered model.
Improve the accuracy of the clustered model.
For deployment only, you must take steps to see compression benefits.

Setup

! pip install -q tensorflow-model-optimization

import tensorflow as tf
import tf_keras as keras
import numpy as np
import tempfile
import os
import tensorflow_model_optimization as tfmot

input_dim = 20
output_dim = 20
x_train = np.random.randn(1, input_dim).astype(np.float32)
y_train = keras.utils.to_categorical(np.random.randn(1), num_classes=output_dim)

def setup_model():
  model = keras.Sequential([
      keras.layers.Dense(input_dim, input_shape=[input_dim]),
      keras.layers.Flatten()
  ])
  return model

def train_model(model):
  model.compile(
      loss=keras.losses.categorical_crossentropy,
      optimizer='adam',
      metrics=['accuracy']
  )
  model.summary()
  model.fit(x_train, y_train)
  return model

def save_model_weights(model):
  _, pretrained_weights = tempfile.mkstemp('.h5')
  model.save_weights(pretrained_weights)
  return pretrained_weights

def setup_pretrained_weights():
  model= setup_model()
  model = train_model(model)
  pretrained_weights = save_model_weights(model)
  return pretrained_weights

def setup_pretrained_model():
  model = setup_model()
  pretrained_weights = setup_pretrained_weights()
  model.load_weights(pretrained_weights)
  return model

def save_model_file(model):
  _, keras_file = tempfile.mkstemp('.h5') 
  model.save(keras_file, include_optimizer=False)
  return keras_file

def get_gzipped_model_size(model):
  # It returns the size of the gzipped model in bytes.
  import os
  import zipfile

  keras_file = save_model_file(model)

  _, zipped_file = tempfile.mkstemp('.zip')
  with zipfile.ZipFile(zipped_file, 'w', compression=zipfile.ZIP_DEFLATED) as f:
    f.write(keras_file)
  return os.path.getsize(zipped_file)

setup_model()
pretrained_weights = setup_pretrained_weights()

2025-10-11 13:50:35.548850: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1760190635.574605 44557 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1760190635.583117 44557 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1760190635.602965 44557 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1760190635.602997 44557 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1760190635.603000 44557 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1760190635.603003 44557 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-10-11 13:50:39.127083: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

Define a clustered model

Cluster a whole model (sequential and functional)

Tips for better model accuracy:

You must pass a pre-trained model with acceptable accuracy to this API. Training models from scratch with clustering results in subpar accuracy.
In some cases, clustering certain layers has a detrimental effect on model accuracy. Check "Cluster some layers" to see how to skip clustering the layers that affect accuracy the most.

To cluster all layers, apply tfmot.clustering.keras.cluster_weights to the model.

import tensorflow_model_optimization as tfmot

cluster_weights = tfmot.clustering.keras.cluster_weights
CentroidInitialization = tfmot.clustering.keras.CentroidInitialization

clustering_params = {
  'number_of_clusters': 3,
  'cluster_centroids_init': CentroidInitialization.KMEANS_PLUS_PLUS
}

model = setup_model()
model.load_weights(pretrained_weights)

clustered_model = cluster_weights(model, **clustering_params)

clustered_model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 cluster_dense_2 (ClusterWe  (None, 20)                823       
 ights)                                                          
                                                                 
 cluster_flatten_2 (Cluster  (None, 20)                0         
 Weights)                                                        
                                                                 
=================================================================
Total params: 823 (4.78 KB)
Trainable params: 423 (1.65 KB)
Non-trainable params: 400 (3.12 KB)
_________________________________________________________________

Cluster some layers (sequential and functional models)

Tips for better model accuracy:

You must pass a pre-trained model with acceptable accuracy to this API. Training models from scratch with clustering results in subpar accuracy.
Cluster later layers with more redundant parameters (e.g. keras.layers.Dense, keras.layers.Conv2D), as opposed to the early layers.
Freeze early layers prior to the clustered layers during fine-tuning. Treat the number of frozen layers as a hyperparameter. Empirically, freezing most early layers is ideal for the current clustering API.
Avoid clustering critical layers (e.g. attention mechanism).

More: the tfmot.clustering.keras.cluster_weights API docs provide details on how to vary the clustering configuration per layer.

# Create a base model
base_model = setup_model()
base_model.load_weights(pretrained_weights)

# Helper function uses `cluster_weights` to make only 
# the Dense layers train with clustering
def apply_clustering_to_dense(layer):
  if isinstance(layer, keras.layers.Dense):
    return cluster_weights(layer, **clustering_params)
  return layer

# Use `keras.models.clone_model` to apply `apply_clustering_to_dense` 
# to the layers of the model.
clustered_model = keras.models.clone_model(
    base_model,
    clone_function=apply_clustering_to_dense,
)

clustered_model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 cluster_dense_3 (ClusterWe  (None, 20)                823       
 ights)                                                          
                                                                 
 flatten_3 (Flatten)         (None, 20)                0         
                                                                 
=================================================================
Total params: 823 (4.78 KB)
Trainable params: 423 (1.65 KB)
Non-trainable params: 400 (3.12 KB)
_________________________________________________________________

Cluster convolutional layers per channel

The clustered model could be passed to further optimizations such as a post training quantization. If the quantization is done per channel, then the model should be clustered per channel as well. This increases the accuracy of the clustered and quantized model.

To cluster per channel, the parameter cluster_per_channel should be set to True. It could be set for some layers or for the whole model.

Tips:

If a model is to be quantized further, you can consider to use cluster preserving QAT technique.
The model could be pruned before applying the clustering per channel. With the parameter preserve_sparsity is set to True, the sparsity is preserved during the clustering per channel. Note that the sparsity and cluster preserving QAT technique should be used in this case.

Cluster custom Keras layer or specify which weights of layer to cluster

tfmot.clustering.keras.ClusterableLayer serves two use cases:

Cluster any layer that is not supported natively, including a custom Keras layer.
Specify which weights of a supported layer are to be clustered.

For an example, the API defaults to only clustering the kernel of the Dense layer. The example below shows how to modify it to also cluster the bias. Note that when deriving from the keras layer, you need to override the function get_clusterable_weights, where you specify the name of the trainable variable to be clustered and the trainable variable itself. For example, if you return an empty list [], then no weights will be clusterable.

Common mistake: clustering the bias usually harms model accuracy too much.

class MyDenseLayer(keras.layers.Dense, tfmot.clustering.keras.ClusterableLayer):

  def get_clusterable_weights(self):
   # Cluster kernel and bias. This is just an example, clustering
   # bias usually hurts model accuracy.
   return [('kernel', self.kernel), ('bias', self.bias)]

# Use `cluster_weights` to make the `MyDenseLayer` layer train with clustering as usual.
model_for_clustering = keras.Sequential([
  tfmot.clustering.keras.cluster_weights(MyDenseLayer(20, input_shape=[input_dim]), **clustering_params),
  keras.layers.Flatten()
])

model_for_clustering.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 cluster_my_dense_layer (Cl  (None, 20)                846       
 usterWeights)                                                   
                                                                 
 flatten_4 (Flatten)         (None, 20)                0         
                                                                 
=================================================================
Total params: 846 (4.95 KB)
Trainable params: 426 (1.66 KB)
Non-trainable params: 420 (3.28 KB)
_________________________________________________________________

You may also use tfmot.clustering.keras.ClusterableLayer to cluster a keras custom layer. To do this, you extend keras.Layer as usual and implement the __init__, call, and build functions, but you also need to extend the clusterable_layer.ClusterableLayer class and implement:

get_clusterable_weights, where you specify the weight kernel to be clustered, as shown above.
get_clusterable_algorithm, where you specify the clustering algorithm for the weight tensor. This is because you need to specify how the custom layer weights are shaped for clustering. The returned clustering algorithm class should be derived from the clustering_algorithm.ClusteringAlgorithm class and the function get_pulling_indices should be overwritten. An example of this function, which supports weights of ranks 1D, 2D, and 3D, can be found here.

An example of this use case can be found here.

Checkpoint and deserialize a clustered model

Your use case: this code is only needed for the HDF5 model format (not HDF5 weights or other formats).

# Define the model.
base_model = setup_model()
base_model.load_weights(pretrained_weights)
clustered_model = cluster_weights(base_model, **clustering_params)

# Save or checkpoint the model.
_, keras_model_file = tempfile.mkstemp('.h5')
clustered_model.save(keras_model_file, include_optimizer=True)

# `cluster_scope` is needed for deserializing HDF5 models.
with tfmot.clustering.keras.cluster_scope():
  loaded_model = keras.models.load_model(keras_model_file)

loaded_model.summary()

WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
WARNING:tensorflow:No training configuration found in the save file, so the model was *not* compiled. Compile it manually.
Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 cluster_dense_4 (ClusterWe  (None, 20)                823       
 ights)                                                          
                                                                 
 cluster_flatten_5 (Cluster  (None, 20)                0         
 Weights)                                                        
                                                                 
=================================================================
Total params: 823 (4.78 KB)
Trainable params: 423 (1.65 KB)
Non-trainable params: 400 (3.12 KB)
_________________________________________________________________
/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tf_keras/src/engine/training.py:3098: UserWarning: You are saving your model as an HDF5 file via `model.save()`. This file format is considered legacy. We recommend using instead the native TF-Keras format, e.g. `model.save('my_model.keras')`.
  saving_api.save_model(

Improve the accuracy of the clustered model

For your specific use case, there are tips you can consider:

Centroid initialization plays a key role in the final optimized model accuracy. In general, kmeans++ initialization outperforms linear, density and random initialization. When not using kmeans++, linear initialization tends to outperform density and random initialization, since it does not tend to miss large weights. However, density initialization has been observed to give better accuracy for the case of using very few clusters on weights with bimodal distributions.
Set a learning rate that is lower than the one used in training when fine-tuning the clustered model.
For general ideas to improve model accuracy, look for tips for your use case(s) under "Define a clustered model".

Deployment

Export model with size compression

Common mistake: both strip_clustering and applying a standard compression algorithm (e.g. via gzip) are necessary to see the compression benefits of clustering.

model = setup_model()
clustered_model = cluster_weights(model, **clustering_params)

clustered_model.compile(
    loss=keras.losses.categorical_crossentropy,
    optimizer='adam',
    metrics=['accuracy']
)

clustered_model.fit(
    x_train,
    y_train
)

final_model = tfmot.clustering.keras.strip_clustering(clustered_model)

print("final model")
final_model.summary()

print("\n")
print("Size of gzipped clustered model without stripping: %.2f bytes" 
      % (get_gzipped_model_size(clustered_model)))
print("Size of gzipped clustered model with stripping: %.2f bytes" 
      % (get_gzipped_model_size(final_model)))

1/1 [==============================] - 0s 448ms/step - loss: 1.8939 - accuracy: 0.0000e+00
final model
Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_5 (Dense)             (None, 20)                420       
                                                                 
 flatten_6 (Flatten)         (None, 20)                0         
                                                                 
=================================================================
Total params: 420 (1.64 KB)
Trainable params: 420 (1.64 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Size of gzipped clustered model without stripping: 3527.00 bytes
WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
Size of gzipped clustered model with stripping: 1488.00 bytes