Learned Sparse Retrieval

Definition

Learned Sparse Retrieval (LSR) is the family of neural retrieval methods that represent queries and documents as sparse vectors in vocabulary space — where weights are produced by a trained model rather than counted from term frequencies. Unlike BM25 or TF-IDF, LSR models learn which terms (including terms not present in the original text) should carry weight for a given input.

The key distinction from classical sparse retrieval: weights are learned via neural networks; from dense retrieval: representations remain sparse and inverted-index compatible.


Why “Learned” Matters

Classical BM25 assigns weights via hand-crafted formulas (TF, IDF, document length normalization). LSR replaces this with a trained transformer that:

  1. Expands vocabulary — a document about “heart attack” also gets weight on “myocardial infarction”, “cardiac arrest”
  2. Suppresses noise — stopwords and irrelevant terms are pushed to zero by regularization
  3. Captures context — the same word “bank” gets different term weights in financial vs. river contexts

This bridges the vocabulary mismatch gap that cripples BM25 without abandoning the inverted-index infrastructure that makes sparse retrieval fast.


Training Paradigms

Knowledge Distillation

Most modern LSR models are trained by distilling a stronger teacher (cross-encoder or dense retriever) into the sparse model:

  • Margin MSE loss: minimize difference between teacher and student score margins
  • KL-divergence: match teacher’s full score distribution over candidates

Regularization for Sparsity

Without explicit constraints, transformer MLM heads produce dense activation patterns. LSR training enforces sparsity via:

  • FLOPS regularizer (used by SPLADE): penalizes the expected number of FLOPs at retrieval time — acts as learned stopword removal
  • L1 regularization: directly penalizes the number of non-zero dimensions

Contrastive Learning with Hard Negatives

Effective LSR training mines hard negatives — documents that are superficially relevant but not truly relevant — to force the model to learn fine-grained term discrimination.


Model Family

ModelOriginKey Feature
SPLADENAVER LABS EuropeBERT MLM + Log-ReLU + MaxPool; term expansion
SPLADE++NAVER LABS EuropeEnsemble distillation, better sparsity/effectiveness tradeoff
SPLADE-v3NAVER LABS EuropeUpdated checkpoints; Hugging Face release (2024)
ELSERElasticSPLADE-based; zero-shot, production-tuned for Elasticsearch
uniCOILCastorini LabScalar weights per existing token only; no expansion; fast
DeepImpactCastorini LabToken-level importance without expansion
Neural SparseOpenSearch/AWSOpen-source SPLADE-style model for OpenSearch

SPLADE and ELSER are the dominant production-grade LSR models. uniCOIL and DeepImpact trade effectiveness for speed.


Mechanism (SPLADE Canonical Example)

Input text
  → Transformer (BERT backbone)
  → MLM head → 30,522-dim vocabulary logits per token
  → Log(1 + ReLU(logits))      ← log-saturation activation
  → MaxPool over all tokens     ← aggregate across positions
  → Sparse vector (~100–300 non-zero dims)

The resulting vector is stored in a standard inverted index. At retrieval time, scoring is a dot product over matching vocabulary dimensions — effectively an augmented BM25 with learned weights.


Comparison

BM25Learned Sparse (SPLADE)Dense (Bi-Encoder)
Term weightingCounted (TF-IDF)Learned (neural)N/A (continuous)
Term expansionManual synonyms onlyLearned automaticallyImplicit in embedding
InterpretabilityHighHigh (vocabulary terms)Low
InfrastructureInverted indexInverted indexANN index
Zero-shot generalizationPoor (vocab mismatch)GoodGood
Latency~1–5ms~5–20ms~5–20ms

LSR is the sparse leg of Hybrid Search, replacing or augmenting BM25:

Query → LSR model → sparse vector → inverted index → top-k sparse results
Query → Bi-Encoder → dense vector → ANN index     → top-k dense results
                                          ↓
                              Reciprocal Rank Fusion

LSR typically outperforms raw BM25 in the sparse leg, improving the overall hybrid pipeline quality.


Benchmarks

LSR models are primarily evaluated on BEIR (Benchmarking IR) — 18 heterogeneous retrieval datasets:

  • SPLADE achieves state-of-the-art among sparse-only models
  • ELSER shows +17% average NDCG@10 over BM25 across 12 BEIR datasets

People