Measuring Search: Metrics Matter

Source: https://jamesrubinstein.medium.com/measuring-search-metrics-matter-de124c2f6f8c
Author: James Rubinstein

Summary

Part of James Rubinstein’s “Measuring Search” series. Explains the major search evaluation metrics, when to use each, and the traps practitioners fall into when relying on a single metric.

The Metrics Landscape

Offline Ranking Metrics

NDCG — The standard. Use when:

You have graded relevance (0–4 scale)
You care about position (higher = more important)
You’re comparing ranking algorithms

MRR — Simpler. Use when:

There’s exactly one correct answer
Question-answering systems
Binary relevance is sufficient

Precision and Recall@k — Fundamental. Use when:

Understanding precision/recall tradeoff
Quick sanity checks

Online Behavioral Metrics

CTR (Click-Through Rate):

Easy to collect, always available
Biased by position, presentation, novelty
Can game: provocative titles get clicks without satisfying users

Dwell time:

Better satisfaction proxy than click alone
“Bounce rate” (return < 30s) = pogo-sticking = dissatisfied

Session abandonment:

User left without successful outcome
Powerful signal if you can define “successful”

Zero-result rate:

Queries with no results = certain failure
Easy to measure, high signal for recall gaps

Metric Traps

The Single Metric Trap

Optimizing a single metric eventually breaks something else:

Optimize CTR → clickbait titles → user dissatisfaction
Optimize NDCG → overfit to judgment set → production regression
Optimize zero-result rate → return everything → precision collapses

The Proxy Trap

Confusing the proxy with the goal:

CTR is a proxy for satisfaction
NDCG is a proxy for CTR is a proxy for satisfaction
Optimize the proxy ≠ optimize the goal

The Freshness Trap

Judgment lists go stale. A metric of 0.83 on a 2-year-old judgment set may not reflect current system quality.

Rubinstein’s Measurement Hierarchy

User outcome (task success, satisfaction survey)
Behavioral proxy (CTR, dwell, conversion)
Offline rating (NDCG, MRR on fresh judgments)

Each level is faster/cheaper than the one above, and less reliable. Use all three; don’t replace upper levels with lower.

Measuring Search Effectiveness — Tunkelang’s parallel framework
Measuring Search - A Human Approach — same author, human judgment focus
Flavors of NDCG — Doug Turnbull, NDCG implementation details

Search Evaluation — broader context
NDCG — primary offline metric
MRR — simpler offline metric
Precision and Recall — foundational
Click Signals — behavioral proxies
Judgment Lists — required for offline metrics

People

James Rubinstein — author
Daniel Tunkelang — parallel measurement work

Awesome Search KG

Explorer

Measuring Search: Metrics Matter

Measuring Search: Metrics Matter

Summary

The Metrics Landscape

Offline Ranking Metrics

Online Behavioral Metrics

Metric Traps

The Single Metric Trap

The Proxy Trap

The Freshness Trap

Rubinstein’s Measurement Hierarchy

People

Graph View

Table of Contents

Backlinks

Awesome Search KG

Explorer

Measuring Search: Metrics Matter

Measuring Search: Metrics Matter

Summary

The Metrics Landscape

Offline Ranking Metrics

Online Behavioral Metrics

Metric Traps

The Single Metric Trap

The Proxy Trap

The Freshness Trap

Rubinstein’s Measurement Hierarchy

Related Articles

Related Concepts

People

Graph View

Table of Contents

Backlinks