Flavors of NDCG — Normalized to What!?

Source: https://softwaredoug.com/blog/2024/05/22/flavors-of-ndcg
Author: Doug Turnbull

Summary

Doug Turnbull clarifies a persistent source of confusion: NDCG has multiple valid mathematical formulations that produce different numbers from the same input. Not knowing which “flavor” a benchmark uses can lead to apples-to-oranges comparisons.

The Two Main Variants

Variant 1: Jarvelin & Kekalainen (2002) — “Grade Gain”

DCG@k = rel₁ + Σᵢ₌₂ᵏ rel_i / log₂(i)

Position 1 gets full relevance with no discount. Positions 2+ are discounted.

Variant 2: Burges et al. (2005) — “Exponential Gain” (default in most ML frameworks)

DCG@k = Σᵢ₌₁ᵏ (2^rel_i - 1) / log₂(i + 1)

All positions discounted. The 2^rel - 1 transform amplifies high-relevance documents.

For grade 3: 2³-1 = 7. For grade 1: 2¹-1 = 1. High-relevance docs count ~7x more than marginal docs.

Why the Difference Matters

Same judgment list, same ranking, two variants:

Variant 1 NDCG@10: 0.72
Variant 2 NDCG@10: 0.68

A 4-point “improvement” comparing systems on different variants is meaningless.

Rule: always specify which DCG formula you’re using when reporting NDCG.

Normalization: “To What”?

NDCG normalizes by IDCG (ideal DCG). The “ideal” assumes the best possible ranking given your judgment list.

Key implication: IDCG is bounded by your judgment pool. If you only judged documents returned by System A, System B may retrieve better documents that appear “unjudged” (treated as grade 0). This biases NDCG against systems that retrieve outside the judged pool.

Common Implementations

Library	Variant Used
scikit-learn `ndcg_score`	Burges (exponential)
RankLib	Jarvelin (grade gain)
Quepid	Configurable
LightGBM LambdaRank	Burges
MS MARCO leaderboard	Burges

Practical Advice

Use the same variant throughout your project
Be explicit when comparing to published benchmarks
For e-commerce with 0–4 grades: Burges variant is better (amplifies highly relevant items)
For binary relevance (0 or 1): variants are equivalent

Session vs Query based Search Evals — same author, complementary
Compute MRR using Pandas — same author, simpler metric
Evaluating Search - Using Human Judgments — how judgments feed into NDCG

NDCG — primary topic
Search Evaluation — broader context
Judgment Lists — input to NDCG computation
MRR — simpler, less ambiguous alternative

Awesome Search KG

Explorer

Flavors of NDCG — Normalized to What!?

Flavors of NDCG — Normalized to What!?

Summary

The Two Main Variants

Variant 1: Jarvelin & Kekalainen (2002) — “Grade Gain”

Variant 2: Burges et al. (2005) — “Exponential Gain” (default in most ML frameworks)

Why the Difference Matters

Normalization: “To What”?

Common Implementations

Practical Advice

Graph View

Table of Contents

Backlinks

Awesome Search KG

Explorer

Flavors of NDCG — Normalized to What!?

Flavors of NDCG — Normalized to What!?

Summary

The Two Main Variants

Variant 1: Jarvelin & Kekalainen (2002) — “Grade Gain”

Variant 2: Burges et al. (2005) — “Exponential Gain” (default in most ML frameworks)

Why the Difference Matters

Normalization: “To What”?

Common Implementations

Practical Advice

Related Articles

Related Concepts

Graph View

Table of Contents

Backlinks