Knowledge Distillation

Definition

Knowledge distillation trains a small “student” model to mimic the outputs of a larger, more powerful “teacher” model. The student learns from the teacher’s soft probability distributions (not just hard labels), capturing nuance the teacher has learned.

In search, this is the primary method for building fast Bi-Encoder retrievers that approach the quality of slow Cross-Encoder rerankers.

The Core Problem It Solves

Cross-Encoder rerankers are highly accurate but slow — they can’t score millions of documents at query time. Bi-Encoder models are fast but less accurate. Distillation bridges the gap:

Cross-encoder (teacher)
  │  scores query-doc pairs
  ▼
Soft relevance scores (e.g., 0.87, 0.43, 0.12...)
  │  used as training signal
  ▼
Bi-encoder (student)
  │  learns to reproduce teacher's ranking
  ▼
Fast retriever with near-teacher quality

Why Soft Labels Beat Hard Labels

Training on binary labels (relevant=1, not relevant=0) loses information. A cross-encoder might score three documents 0.9, 0.7, 0.3 — all “relevant” but clearly ranked. Distillation preserves this gradient, giving the student richer signal per training example.

Distillation for Embedding Models

Margin MSE loss: minimize the difference between teacher’s score margins and student’s score margins across document pairs:

L = (score_teacher(q, d+) - score_teacher(q, d-)) 
  - (score_student(q, d+) - score_student(q, d-))

Used in SBERT, BGE, and most production-grade embedding models.

Distillation vs. Fine-tuning

	Embedding Fine-tuning	Knowledge Distillation
Training signal	Human labels / click data	Teacher model scores
Requires human annotation	Yes (or implicit feedback)	No (teacher generates labels)
Quality ceiling	Label quality	Teacher quality
Cost	Human labeling	Teacher inference cost

Distillation can be combined with fine-tuning: fine-tune a cross-encoder on human labels first, then distill that teacher into a bi-encoder.

Embedding Fine-tuning — related training approach
Bi-Encoder — primary distillation target (student)
Cross-Encoder — primary distillation source (teacher)
Reranking — the task cross-encoders excel at
Dense Embeddings — output of distilled bi-encoders
BERT — backbone architecture for both teacher and student

Awesome Search KG

Explorer

Knowledge Distillation

Knowledge Distillation

Definition

The Core Problem It Solves

Why Soft Labels Beat Hard Labels

Distillation for Embedding Models

Distillation vs. Fine-tuning

Graph View

Table of Contents

Backlinks

Awesome Search KG

Explorer

Knowledge Distillation

Knowledge Distillation

Definition

The Core Problem It Solves

Why Soft Labels Beat Hard Labels

Distillation for Embedding Models

Distillation vs. Fine-tuning

Related Concepts

Graph View

Table of Contents

Backlinks