Can BM25 be a Probability?

Author: Doug Turnbull

Summary

BM25 scores are unbounded and not interpretable as probabilities. This creates a fundamental problem for hybrid search: embedding similarities can be modeled as probabilities and combined rigorously, but BM25 cannot. This article reviews Bayesian BM25 (BB25) — an attempt to convert BM25 scores into calibrated probabilities enabling principled hybrid fusion.

The Core Problem

BM25 models odds (a likelihood ratio), not probabilities:

rel_odds(t) = P(t|R=1) / P(t|R=0)

Other relevance factors — embedding similarity, CTR, page rank, recency — can be modeled as probabilities and combined:

  • AND: P(R | E) * P(R | L)
  • OR: prior-based combinations

BM25 doesn’t fit this framework, so practitioners resort to magic number boosts and Reciprocal Rank Fusion.

Bayesian BM25 (BB25)

BB25 transforms BM25 into a probability using Bayesian reasoning:

Prior

Default relevance assumption from TF and field length:

tf_prior = 0.2 + 0.7 * min(1, TERM_FREQ / 10)
norm_prior = 0.3 + 0.6 * (1 - min(bm25_norm_length, 0.5))
bb25_prior = clamp(0.7 * tf_prior + 0.3 * norm_prior, 0.1, 0.9)

Likelihood

BM25 score passed through sigmoid as evidence:

bb25_likelihood = sigmoid(ALPHA * (bm25_score - BETA))
  • BETA = median BM25 score (corpus-specific)
  • ALPHA = steepness of transition

Posterior (Bayes Theorem)

bb25_lp = bb25_prior * bb25_likelihood
bb25_marginal = bb25_lp + (1 - bb25_likelihood) * (1 - bb25_prior)
bb25_posterior = bb25_lp / bb25_marginal

Calibration

The real power: learning alpha/beta from labels via gradient descent. Optimizes against ground truth so P(R | L) and P(R | E) are on the same scale — enabling principled hybrid search.

Alternative Views

Alternative 1: Make other factors look like BM25 (Craswell et al.) — scale non-BM25 signals with log/saturation functions; use Elasticsearch rank feature query. Avoids converting BM25 to probability.

Alternative 2: Just learn the boosts — learn weights e and l directly via random search or Bayesian optimization. Less interpretable but often wins in practice.

Alternative 3: Scaled BM25 — Lucene’s BM25 already scales TF from 0-1; IDF can also be normalized by log(num_docs). Could perform gradient descent on k1/b directly.

People