Distilling Retrieval Pipelines to a Single Embedding Model

Source: https://dtunkelang.medium.com/distilling-retrieval-pipelines-to-a-single-embedding-model-606f3ecf0c91
Author: Daniel Tunkelang

Summary

Daniel Tunkelang argues that multi-stage retrieval pipelines (BM25 → cross-encoder reranker) can be distilled into a single high-quality embedding model, achieving most of the pipeline’s quality at single-stage serving cost.

The Serving Cost Problem

A typical high-quality Retrieval Pipeline:

Query → BM25 (1ms) → top-1000 → Cross-encoder reranker (500ms) → top-10

The cross-encoder dominates latency. For production systems:

500ms is unacceptable for interactive search
Scaling cross-encoders is expensive (no pre-computation)
Yet BM25-only retrieval misses ~15% of relevant documents

Knowledge Distillation as Solution

The pipeline itself generates high-quality training signal:

Step 1: Run BM25 + Cross-encoder on training queries
    → high-quality (query, relevant_doc) pairs with pipeline-quality labels

Step 2: Train a bi-encoder to match pipeline's relevance judgments
    → student learns to approximate the teacher pipeline

Step 3: Deploy student bi-encoder alone
    → single-stage, fast, approximates pipeline quality

This is the same approach used to train ELSER: teacher ensemble (MiniLM + MonoT5-3B) → compressed 100M parameter student.

Bag-of-Documents as Training Signal

Tunkelang frames this using the Bag-of-Documents Model:

The pipeline estimates P(d | q) for training queries
The student bi-encoder learns to reproduce this distribution
The student’s embedding space effectively encodes the pipeline’s relevance judgments

Quality vs. Speed Tradeoff

System	Latency	NDCG@10	Notes
BM25 alone	1ms	0.228	Baseline
Dense bi-encoder	10ms	0.253	Trained generally
Distilled bi-encoder	10ms	0.263	Trained on pipeline output
Full pipeline	500ms	0.271	Gold standard

The distilled model captures ~87% of the pipeline’s improvement over BM25 at 10ms latency.

Limitations

Distribution shift: distilled model trained on pipeline’s output; may not generalize to out-of-distribution queries
Pipeline quality ceiling: student can’t exceed teacher quality
Static: distilled model needs retraining when pipeline improves

Relation to Constructing Query Vectors

Distillation can be seen as an offline form of Hypothetical Document Embeddings:

HyDE: at query time, generate what a relevant document looks like
Distillation: at training time, teach the model what relevant documents look like per query type

Agentic Search as an Agile Engineering Process — same author on retrieval evolution
Elastic Learned Sparse Encoder ELSER Retrieval Performance — distillation in production
SPLADE - Sparse Bi-Encoder BERT Model for First-Stage Ranking — SPLADE++ uses distillation

Retrieval Pipeline — what is being distilled
Embedding Fine-tuning — distillation is a training technique
Bag-of-Documents Model — theoretical framing
Dense Vector Retrieval — deployment target
Cross-Encoder — teacher in the distillation
Bi-Encoder — student being trained

People

Daniel Tunkelang — author

Awesome Search KG

Explorer

Distilling Retrieval Pipelines to a Single Embedding Model

Distilling Retrieval Pipelines to a Single Embedding Model

Summary

The Serving Cost Problem

Knowledge Distillation as Solution

Bag-of-Documents as Training Signal

Quality vs. Speed Tradeoff

Limitations

Relation to Constructing Query Vectors

People

Graph View

Table of Contents

Backlinks

Awesome Search KG

Explorer

Distilling Retrieval Pipelines to a Single Embedding Model

Distilling Retrieval Pipelines to a Single Embedding Model

Summary

The Serving Cost Problem

Knowledge Distillation as Solution

Bag-of-Documents as Training Signal

Quality vs. Speed Tradeoff

Limitations

Relation to Constructing Query Vectors

Related Articles

Related Concepts

People

Graph View

Table of Contents

Backlinks