Introduction to Matryoshka Embedding Models

Source: https://huggingface.co/blog/matryoshka
Publisher: Hugging Face

Summary

Hugging Face’s official introduction to Matryoshka Representation Learning (MRL) — the technique enabling dimension-flexible embeddings — covering the theory, training with sentence-transformers, and practical deployment.

Matryoshka Representation Learning (MRL)

MRL, introduced by Kusupati et al. (2022), trains embedding models so that any prefix of the embedding vector is itself a valid, semantically meaningful embedding.

Like Russian nesting dolls (matryoshka), the full embedding contains smaller embeddings within it:

[dim 1..8] ← 8-dim embedding
[dim 1..16] ← 16-dim embedding
[dim 1..32] ← 32-dim embedding
...
[dim 1..1536] ← full embedding

Each prefix independently ranks documents correctly.

Training with MatryoshkaLoss

Standard training: minimize loss for a single embedding dimension.
MRL training: minimize a sum of losses across multiple dimensions simultaneously.

from sentence_transformers import SentenceTransformer
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss
 
model = SentenceTransformer("bert-base-uncased")
 
inner_loss = MultipleNegativesRankingLoss(model)
loss = MatryoshkaLoss(
    model, 
    loss=inner_loss, 
    matryoshka_dims=[64, 128, 256, 512, 768]  # train at all these dims
)

During training, the loss is computed at each specified dimension, forcing the model to encode information in a hierarchical, prefix-compatible way.

Supported Dimensions

For nomic-embed-text-v1.5 (example):

  • 768 → full quality
  • 512 → ~99.5% quality
  • 256 → ~99% quality
  • 128 → ~98% quality
  • 64 → ~96% quality

The quality loss is surprisingly small even at aggressive truncation.

Why It Works

Standard embeddings pack information across all dimensions uniformly. MRL forces the model to prioritize: put the most important semantic information in the earliest dimensions, and refine/add detail in later dimensions.

This is analogous to how JPEG compression works — the coarse structure is encoded first, fine details added by later bits.

Use Cases

  1. Adaptive retrieval: fast first pass (truncated) → accurate rerank (full)
  2. Tiered serving: low-latency tier (small dims), high-quality tier (full dims)
  3. Cost reduction: store smaller embeddings when quality constraints allow
  4. Mobile/edge: deploy small embedding for on-device search

Models Supporting MRL

ModelFull DimsMin DimsNotes
text-embedding-3-small15368OpenAI
text-embedding-3-large30728OpenAI
nomic-embed-text-v1.576864Open source
mxbai-embed-large-v1102464Open source