Nearest Neighbor Indexes for Similarity Search
“There is no ‘one-size-fits-all’ in similarity search.” Index selection depends on balancing accuracy, speed, and memory.
Index types
Flat Index
Exhaustive comparison against every vector. Perfect recall, slow for large datasets.
- Use when: dataset <10K vectors or quality is paramount
- L2:
IndexFlatL2| IP:IndexFlatIP - No training required
LSH (Locality Sensitive Hashing)
Groups vectors into buckets using hash functions that maximize collisions. High sensitivity to dimensionality curse.
- Use when: low-dimensional vectors, small indexes
- Not suitable for 128+ dimensions or large datasets
- Parameter:
nbits(higher = better accuracy, more memory)
HNSW (Hierarchical Navigable Small World Graphs)
Multi-layered graph structure. Consistently top-performing for ANN search.
- Use when: larger, high-dimensional datasets with ample RAM
- Parameters:
M(connections per vertex),efSearch,efConstruction - High memory footprint (~1.6GB for Sift1M at M=128)
IVF (Inverted File Index)
Clusters vectors via Voronoi diagrams, restricts search to relevant cells.
- Use when: scalable balance of metrics needed
- Parameters:
nlist(cells/clusters),nprobe(cells to search) - Addresses edge problem with
nprobetuning - Requires training step
Benchmark results (Sift1M, 128d, M1 chip)
| Index | Memory | Query Time | Recall |
|---|---|---|---|
| Flat | ~500 MB | ~18 ms | 1.0 |
| LSH | 20–600 MB | 1.7–30 ms | 0.4–0.85 |
| HNSW | 600–1600 MB | 0.6–2.1 ms | 0.5–0.95 |
| IVF | ~520 MB | 1–9 ms | 0.7–0.95 |