Sparse Vector Retrieval

Definition

Sparse vector retrieval represents documents and queries as high-dimensional vectors where most values are zero — only a small fraction of dimensions are non-zero. The dimensions correspond to vocabulary terms (typically 30,000+ tokens), and non-zero values represent the learned importance of each term.

Unlike traditional bag-of-words methods (BM25, TF-IDF), modern sparse retrieval learns these weights via neural models rather than counting.

How It Differs from Dense Retrieval

DimensionSparseDense
Vector size30,000–100,000 dims384–4096 dims
Non-zero values~100–300 per docAll dims non-zero
InterpretabilityHigh (term weights visible)Low (opaque)
Lexical matchingNativeRequires fine-tuning
StorageInverted index friendlyRequires ANN index
SpeedVery fast (hardware-optimized)Slower (approx. search)

Key Models

SPLADE

  • BERT MLM head → vocabulary logits → Log(1 + ReLU(x)) → MaxPool over tokens
  • Learns term expansion: “jaguar” in car context → high weight for “car”, “vehicle”, “speed”
  • Created at NAVER LABS Europe by Stéphane Clinchant and Thibault Formal

ELSER

  • Elastic’s production SPLADE-based model
  • 17% NDCG@10 improvement over BM25
  • Optimized for Elasticsearch inverted index

uniCOIL

  • Simpler: learns scalar weights for existing query terms only (no expansion)
  • Fast, lower effectiveness than SPLADE

DeepImpact

  • Token-level importance prediction without expansion
  • Trades some recall for speed

Term Expansion

The key advantage of learned sparse models: they expand queries/documents beyond literal terms.

Example: Query “how to fix a flat tire” might expand to:

  • “puncture”, “wheel”, “replace”, “pressure”, “pump”, “sidewall”

This bridges the vocabulary mismatch gap without requiring dense embeddings.

Storage and Retrieval

Sparse vectors map naturally to inverted indexes:

"car":     [doc1: 0.8, doc3: 0.4, doc7: 0.6]
"vehicle": [doc1: 0.6, doc2: 0.9, doc5: 0.3]

Retrieval uses standard posting list intersection + scoring — benefiting from decades of IR optimization.

Hybrid Search typically combines sparse + dense:

  • Sparse (SPLADE/BM25): lexical precision, term matching
  • Dense (Bi-Encoder): semantic recall, paraphrase matching
  • Combined via RRF (Reciprocal Rank Fusion) or weighted linear combination

People