BM25

Definition

BM25 (Best Match 25) is the dominant lexical retrieval algorithm — the scoring function behind Elasticsearch’s default relevance, Solr, Lucene, and most production search engines. It is a probabilistic term-frequency weighting model that improves on TF-IDF by accounting for document length and term-frequency saturation.

Formula

BM25(d,q) = Σ_{t∈q} IDF(t) × (tf(t,d) × (k1+1)) / (tf(t,d) + k1×(1-b+b×|d|/avgdl))

Where:

  • tf(t,d) = term frequency of term t in document d
  • IDF(t) = inverse document frequency = log((N-n+0.5)/(n+0.5)+1)
  • |d| = document length in tokens
  • avgdl = average document length in the corpus
  • k1 ∈ [1.2, 2.0] = term frequency saturation parameter
  • b ∈ [0, 1] = document length normalization (typically 0.75)

Key Innovations Over TF-IDF

1. Term Frequency Saturation

In TF-IDF: score scales linearly with tf (10 occurrences = 10× weight of 1 occurrence).
In BM25: the k1 parameter saturates tf — going from 1 to 2 occurrences increases score a lot; going from 10 to 20 occurrences adds little.

Intuition: Mentioning “python” once signals relevance; mentioning it 50 times doesn’t mean 50× more relevant.

2. Document Length Normalization

The b parameter penalizes long documents. Without it, long documents would score higher simply because they contain more terms — even if they’re less dense with the topic.

b=0: no length normalization
b=1: full normalization
b=0.75: standard compromise

Parameters

BM25 parameters are tunable:

ParameterDefaultEffect
k1=1.2LowTerm saturation happens quickly
k1=2.0HighMore benefit from repeated terms
b=0.0Ignore document length
b=1.0Strong length normalization

Elasticsearch defaults: k1=1.2, b=0.75. Tuning can improve NDCG by 2–5%.

BM25F

BM25F (BM25 with Fields) extends BM25 to handle multi-field documents (title, body, URL):

  • Different field weights (title match more important than body match)
  • Field-specific length normalization

Essential for e-commerce (product name vs. description) and enterprise search (email subject vs. body).

Bayesian BM25 (BB25)

A probabilistic extension by Doug Turnbull:

  • Treats BM25 parameters as priors
  • Updates based on collection statistics
  • Particularly useful for out-of-vocabulary and low-frequency terms
DimensionBM25Semantic (Bi-Encoder)
Vocabulary mismatchFailsHandles
Exact term matchExcellentCan miss
SpeedVery fastSlower (ANN)
InterpretabilityHighLow
OOV termsFailsHandles

The standard combination: Hybrid Search (BM25 + bi-encoder).

People