Learned Sparse Retrieval

Definition

Learned Sparse Retrieval (LSR) is the family of neural retrieval methods that represent queries and documents as sparse vectors in vocabulary space — where weights are produced by a trained model rather than counted from term frequencies. Unlike BM25 or TF-IDF, LSR models learn which terms (including terms not present in the original text) should carry weight for a given input.

The key distinction from classical sparse retrieval: weights are learned via neural networks; from dense retrieval: representations remain sparse and inverted-index compatible.

Why “Learned” Matters

Classical BM25 assigns weights via hand-crafted formulas (TF, IDF, document length normalization). LSR replaces this with a trained transformer that:

Expands vocabulary — a document about “heart attack” also gets weight on “myocardial infarction”, “cardiac arrest”
Suppresses noise — stopwords and irrelevant terms are pushed to zero by regularization
Captures context — the same word “bank” gets different term weights in financial vs. river contexts

This bridges the vocabulary mismatch gap that cripples BM25 without abandoning the inverted-index infrastructure that makes sparse retrieval fast.

Training Paradigms

Knowledge Distillation

Most modern LSR models are trained by distilling a stronger teacher (cross-encoder or dense retriever) into the sparse model:

Margin MSE loss: minimize difference between teacher and student score margins
KL-divergence: match teacher’s full score distribution over candidates

Regularization for Sparsity

Without explicit constraints, transformer MLM heads produce dense activation patterns. LSR training enforces sparsity via:

FLOPS regularizer (used by SPLADE): penalizes the expected number of FLOPs at retrieval time — acts as learned stopword removal
L1 regularization: directly penalizes the number of non-zero dimensions

Contrastive Learning with Hard Negatives

Effective LSR training mines hard negatives — documents that are superficially relevant but not truly relevant — to force the model to learn fine-grained term discrimination.

Model Family

Model	Origin	Key Feature
SPLADE	NAVER LABS Europe	BERT MLM + Log-ReLU + MaxPool; term expansion
SPLADE++	NAVER LABS Europe	Ensemble distillation, better sparsity/effectiveness tradeoff
SPLADE-v3	NAVER LABS Europe	Updated checkpoints; Hugging Face release (2024)
ELSER	Elastic	SPLADE-based; zero-shot, production-tuned for Elasticsearch
uniCOIL	Castorini Lab	Scalar weights per existing token only; no expansion; fast
DeepImpact	Castorini Lab	Token-level importance without expansion
Neural Sparse	OpenSearch/AWS	Open-source SPLADE-style model for OpenSearch

SPLADE and ELSER are the dominant production-grade LSR models. uniCOIL and DeepImpact trade effectiveness for speed.

Mechanism (SPLADE Canonical Example)

Input text
  → Transformer (BERT backbone)
  → MLM head → 30,522-dim vocabulary logits per token
  → Log(1 + ReLU(logits))      ← log-saturation activation
  → MaxPool over all tokens     ← aggregate across positions
  → Sparse vector (~100–300 non-zero dims)

The resulting vector is stored in a standard inverted index. At retrieval time, scoring is a dot product over matching vocabulary dimensions — effectively an augmented BM25 with learned weights.

Comparison

	BM25	Learned Sparse (SPLADE)	Dense (Bi-Encoder)
Term weighting	Counted (TF-IDF)	Learned (neural)	N/A (continuous)
Term expansion	Manual synonyms only	Learned automatically	Implicit in embedding
Interpretability	High	High (vocabulary terms)	Low
Infrastructure	Inverted index	Inverted index	ANN index
Zero-shot generalization	Poor (vocab mismatch)	Good	Good
Latency	~1–5ms	~5–20ms	~5–20ms

Role in Hybrid Search

LSR is the sparse leg of Hybrid Search, replacing or augmenting BM25:

Query → LSR model → sparse vector → inverted index → top-k sparse results
Query → Bi-Encoder → dense vector → ANN index     → top-k dense results
                                          ↓
                              Reciprocal Rank Fusion

LSR typically outperforms raw BM25 in the sparse leg, improving the overall hybrid pipeline quality.

Benchmarks

LSR models are primarily evaluated on BEIR (Benchmarking IR) — 18 heterogeneous retrieval datasets:

SPLADE achieves state-of-the-art among sparse-only models
ELSER shows +17% average NDCG@10 over BM25 across 12 BEIR datasets

Sparse Vector Retrieval — the retrieval mechanism LSR powers
Sparse Embeddings — the representation type LSR produces
SPLADE — primary LSR model
ELSER — Elastic’s production LSR model
BM25 — classical sparse baseline that LSR improves upon
Dense Vector Retrieval — complementary approach; combined in Hybrid Search
Cross-Encoder — often used as teacher model for LSR distillation
Hybrid Search — LSR as the sparse leg

People

Thibault Formal — SPLADE co-inventor, NAVER LABS Europe
Stéphane Clinchant — SPLADE co-inventor, NAVER LABS Europe
Thomas Veasey — ELSER, Elastic

Awesome Search KG

Explorer

Learned Sparse Retrieval

Learned Sparse Retrieval

Definition

Why “Learned” Matters

Training Paradigms

Knowledge Distillation

Regularization for Sparsity

Contrastive Learning with Hard Negatives

Model Family

Mechanism (SPLADE Canonical Example)

Comparison

Role in Hybrid Search

Benchmarks

People

Graph View

Table of Contents

Backlinks

Awesome Search KG

Explorer

Learned Sparse Retrieval

Learned Sparse Retrieval

Definition

Why “Learned” Matters

Training Paradigms

Knowledge Distillation

Regularization for Sparsity

Contrastive Learning with Hard Negatives

Model Family

Mechanism (SPLADE Canonical Example)

Comparison

Role in Hybrid Search

Benchmarks

Related Concepts

Related Articles

People

Graph View

Table of Contents

Backlinks