Knowledge Distillation

Definition

Knowledge distillation trains a small “student” model to mimic the outputs of a larger, more powerful “teacher” model. The student learns from the teacher’s soft probability distributions (not just hard labels), capturing nuance the teacher has learned.

In search, this is the primary method for building fast Bi-Encoder retrievers that approach the quality of slow Cross-Encoder rerankers.

The Core Problem It Solves

Cross-Encoder rerankers are highly accurate but slow — they can’t score millions of documents at query time. Bi-Encoder models are fast but less accurate. Distillation bridges the gap:

Cross-encoder (teacher)
  │  scores query-doc pairs
  ▼
Soft relevance scores (e.g., 0.87, 0.43, 0.12...)
  │  used as training signal
  ▼
Bi-encoder (student)
  │  learns to reproduce teacher's ranking
  ▼
Fast retriever with near-teacher quality

Why Soft Labels Beat Hard Labels

Training on binary labels (relevant=1, not relevant=0) loses information. A cross-encoder might score three documents 0.9, 0.7, 0.3 — all “relevant” but clearly ranked. Distillation preserves this gradient, giving the student richer signal per training example.

Distillation for Embedding Models

Margin MSE loss: minimize the difference between teacher’s score margins and student’s score margins across document pairs:

L = (score_teacher(q, d+) - score_teacher(q, d-)) 
  - (score_student(q, d+) - score_student(q, d-))

Used in SBERT, BGE, and most production-grade embedding models.

Distillation vs. Fine-tuning

Embedding Fine-tuningKnowledge Distillation
Training signalHuman labels / click dataTeacher model scores
Requires human annotationYes (or implicit feedback)No (teacher generates labels)
Quality ceilingLabel qualityTeacher quality
CostHuman labelingTeacher inference cost

Distillation can be combined with fine-tuning: fine-tune a cross-encoder on human labels first, then distill that teacher into a bi-encoder.