K-Means partitioning tree configuration.

Used in the notebooks

Used in the tutorials

In ScaNN, we use single layer K-Means tree to partition the database (index) as a way to reduce search space.

num_leaves How many leaves (partitions) to have on the K-Means tree. In general, a good starting point would be the square root of the database size.
num_leaves_to_search During inference ScaNN will compare the query vector against all the partition centroids and select the closest num_leaves_to_search ones to search in. The more leaves to search, the better the retrieval quality, and higher computational cost.
training_sample_size How many database embeddings to sample for the K-Means training. Generally, you want to use a large enough sample of the database to train K-Means so that it's representative enough. However, large sample can also lead to longer training time. A good starting value would be 100k, or the whole dataset if it's smaller than that.
min_partition_size Smallest allowable cluster size. Any clusters smaller than this will be removed, and its data points will be merged with other clusters. Recommended to be 1/10 of average cluster size (size of database divided by num_leaves)
training_iterations How many itrations to train K-Means.
spherical If true, L2 normalize the K-Means centroids.
quantize_centroids If true, quantize centroids to int8.
random_init If true, use random init. Otherwise use K-Means++.



min_partition_size 50
quantize_centroids False
random_init True
spherical False
training_iterations 12
training_sample_size 100000