Choosing Your Search Relevance Evaluation Metric

Source: https://opensourceconnections.com/blog/2020/02/28/choosing-your-search-relevance-metric/
Publisher: OpenSource Connections

Summary

A decision guide for choosing between NDCG, MRR, Precision and Recall, and behavioral metrics for search quality evaluation — emphasizing that the right metric depends on use case, not convention.

The Decision Framework

Step 1: What Kind of Relevance Do You Have?

Graded relevance (0–4 scale, different quality levels):

  • Use NDCG — captures grade differences
  • Examples: e-commerce (perfect match vs. acceptable), enterprise search

Binary relevance (relevant or not):

  • Use MRR or Precision@k — simpler, appropriate for binary
  • Examples: QA systems, navigational search

Step 2: How Many Relevant Documents Exist?

One correct answer (navigational, QA):

  • Use MRR — focused on finding that one result
  • Example: “what is the company’s refund policy?” → single document is correct

Multiple acceptable answers (product search, informational):

  • Use NDCG@k or Precision@k — rewards finding multiple relevant docs
  • Example: “running shoes” → many relevant products

Step 3: Does Rank Order Matter?

Yes — higher rank = more important:

  • Use NDCG or MRR — position-sensitive

No — just care about recall at cutoff:

  • Use Recall@k or Precision@k — position-insensitive

Step 4: Do You Have Judgment Labels?

Yes, with grades: NDCG
Yes, binary: MRR or Precision@k
No (behavioral data only): CTR, conversion rate, session success (Click Signals)

Metric Selection Summary

Search TypeRecommended PrimaryRecommended Secondary
E-commerce product searchNDCG@10Conversion rate
QA / knowledge baseMRR@10P@1
Enterprise document searchNDCG@5Session success
NavigationalP@1MRR
Exploratory/discoveryNDCG + diversitySession depth

Common Mistakes in Metric Selection

  1. Copying academic benchmarks blindly: MS MARCO uses NDCG@10 — appropriate for passage retrieval, may not suit your use case
  2. Metric without business alignment: high NDCG doesn’t always mean business success
  3. Single metric tyranny: no single metric captures everything; always track 2–3