Understanding the BM25 Full Text Search Algorithm

Source: https://emschwartz.me/understanding-the-bm25-full-text-search-algorithm/ Author: Evan Schwartz

Summary

Clear explanation of BM25’s formula, parameters, and key design decisions. One of the most accessible introductions to BM25 mechanics.

Formula Components

BM25(q, d) = Σ IDF(qi) × (TF(qi, d) × (k1 + 1)) / (TF(qi, d) + k1 × (1 - b + b × |d|/avgdl))

1. IDF — Inverse Document Frequency

Rare terms get higher weight than common terms:

IDF(qi) = ln((N - n(qi) + 0.5) / (n(qi) + 0.5))

2. TF — Term Frequency with Saturation

Diminishing returns on repeated terms, controlled by k1.

3. Length Normalization

Prevents longer documents from dominating, controlled by b.

Key Parameters

ParameterTypical ValueEffect
k11.2–2.0Controls TF saturation — how quickly repeated terms lose incremental value
b0.75Controls length normalization strength (0 = disabled)

Key Design Insight

BM25 drops probability-calculation terms that don’t affect document ordering, and crucially assumes most documents are not relevant (R=r=0). This makes it practical without pre-existing relevance data.

Important Limitation

BM25 scores cannot be compared across different collections or time periods — only within the same collection for the same query.