sift1m

  • Description:

Pre-trained embeddings for approximate nearest neighbor search using the Euclidean distance. This dataset consists of two splits:

  1. 'database': consists of 1,000,000 data points, each has features: 'embedding' (128 floats), 'index' (int64), 'neighbors' (empty list).
  2. 'test': consists of 10,000 data points, each has features: 'embedding' (128 floats), 'index' (int64), 'neighbors' (list of 'index' and 'distance' of the nearest neighbors in the database.)
Split Examples
'database' 1,000,000
'test' 10,000
  • Feature structure:
FeaturesDict({
    'embedding': Tensor(shape=(128,), dtype=float32),
    'index': Scalar(shape=(), dtype=int64),
    'neighbors': Sequence({
        'distance': Scalar(shape=(), dtype=float32),
        'index': Scalar(shape=(), dtype=int64),
    }),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
embedding Tensor (128,) float32
index Scalar int64 Index within the split.
neighbors Sequence The computed neighbors, which is only available for the test split.
neighbors/distance Scalar float32 Neighbor distance.
neighbors/index Scalar int64 Neighbor index.
  • Citation:
@article{jegou2010product,
  title={Product quantization for nearest neighbor search},
  author={Jegou, Herve and Douze, Matthijs and Schmid, Cordelia},
  journal={IEEE transactions on pattern analysis and machine intelligence},
  volume={33},
  number={1},
  pages={117--128},
  year={2010},
  publisher={IEEE}
}