BM25
Definition
BM25 (Best Match 25) is the dominant lexical retrieval algorithm — the scoring function behind Elasticsearch’s default relevance, Solr, Lucene, and most production search engines. It is a probabilistic term-frequency weighting model that improves on TF-IDF by accounting for document length and term-frequency saturation.
Formula
BM25(d,q) = Σ_{t∈q} IDF(t) × (tf(t,d) × (k1+1)) / (tf(t,d) + k1×(1-b+b×|d|/avgdl))
Where:
tf(t,d)= term frequency of term t in document dIDF(t)= inverse document frequency = log((N-n+0.5)/(n+0.5)+1)|d|= document length in tokensavgdl= average document length in the corpusk1∈ [1.2, 2.0] = term frequency saturation parameterb∈ [0, 1] = document length normalization (typically 0.75)
Key Innovations Over TF-IDF
1. Term Frequency Saturation
In TF-IDF: score scales linearly with tf (10 occurrences = 10× weight of 1 occurrence).
In BM25: the k1 parameter saturates tf — going from 1 to 2 occurrences increases score a lot; going from 10 to 20 occurrences adds little.
Intuition: Mentioning “python” once signals relevance; mentioning it 50 times doesn’t mean 50× more relevant.
2. Document Length Normalization
The b parameter penalizes long documents. Without it, long documents would score higher simply because they contain more terms — even if they’re less dense with the topic.
b=0: no length normalization
b=1: full normalization
b=0.75: standard compromise
Parameters
BM25 parameters are tunable:
| Parameter | Default | Effect |
|---|---|---|
| k1=1.2 | Low | Term saturation happens quickly |
| k1=2.0 | High | More benefit from repeated terms |
| b=0.0 | — | Ignore document length |
| b=1.0 | — | Strong length normalization |
Elasticsearch defaults: k1=1.2, b=0.75. Tuning can improve NDCG by 2–5%.
BM25F
BM25F (BM25 with Fields) extends BM25 to handle multi-field documents (title, body, URL):
- Different field weights (title match more important than body match)
- Field-specific length normalization
Essential for e-commerce (product name vs. description) and enterprise search (email subject vs. body).
Bayesian BM25 (BB25)
A probabilistic extension by Doug Turnbull:
- Treats BM25 parameters as priors
- Updates based on collection statistics
- Particularly useful for out-of-vocabulary and low-frequency terms
BM25 vs. Semantic Search
| Dimension | BM25 | Semantic (Bi-Encoder) |
|---|---|---|
| Vocabulary mismatch | Fails | Handles |
| Exact term match | Excellent | Can miss |
| Speed | Very fast | Slower (ANN) |
| Interpretability | High | Low |
| OOV terms | Fails | Handles |
The standard combination: Hybrid Search (BM25 + bi-encoder).
Related Concepts
-
Embeddings — BM25 is the classical (non-neural) sparse retrieval baseline
-
Sparse Embeddings — BM25 as the foundational sparse representation
-
Hybrid Search — BM25 as the sparse component
-
Sparse Vector Retrieval — BM25 is the classical sparse retrieval baseline
-
SPLADE — learned sparse model that outperforms BM25
-
ELSER — +17% NDCG@10 over BM25
-
Retrieval Pipeline — BM25 often the fast first stage
-
NDCG — BM25 baseline NDCG@10 = 0.228 on MS MARCO
People
-
Doug Turnbull — BM25 in production, BB25 extension, practical tuning guides
-
Can BM25 be a Probability — Doug Turnbull; BB25 framework, calibration via gradient descent for hybrid search