Map of Content: Search Quality Assurance & Query Understanding

Entry point for the knowledge graph covering the Search Quality Assurance and Query Understanding sections of the Awesome Search collection.


Evaluation Metrics

Evaluation Metrics

Ranking Quality

MetricTypeBest For
NDCGOffline, gradedOverall ranking quality
MAPOffline, binaryMulti-document recall; IR benchmarks
MRROffline, binaryQA / known-item search
Hit Rate@KOffline, binaryRAG retrieval coverage, recommendation systems
Precision and RecallOffline, foundationalCoverage + accuracy; P@K
UDCGOffline, agenticRAG/LLM pipelines; penalizes distractors

Diversity Metrics

MetricWhat it captures
MMR (Maximal Marginal Relevance)Penalizes redundant results; active reranking for diversity
APD (Average Pairwise Distance)Average dissimilarity across result pairs; passive measurement
Diversity MetricsOverview of all diversity approaches

Behavioral / Product Metrics

MetricWhat it captures
Click SignalsReal-user engagement; CTR, dwell time
Clicks ResidualGap between expected and observed clicks — query success signal
Zero ResultsDirect coverage failure

Evaluation Infrastructure


Query Sampling

How to build a representative query set for evaluation:

  • Query Sampling — Random, Stratified, and PPS sampling explained and compared

Query Understanding Components


  • Query Relaxation — term dropping, AND→OR, facet/attribute softening, semantic hierarchy traversal

  • Query Specificity — spectrum from “shoes” (browse) to exact SKU; impacts retrieval strategy, ranking, and UX

Key People

PersonKey Contributions
Daniel TunkelangQuery Understanding series, Evaluating Search series, “Intent Not Inventory”
Doug TurnbullFlavors of NDCG, Judgment Lists, Quepid, Session eval
James RubinsteinMeasuring Search series (metrics + human approach)
Andreas WagnerThree Pillars: Findability, Relevance, Discovery
Jo Kristian BergumLLM-as-a-judge for retrieval (Vespa)
Aparna DhinakaranLLM-as-judge methodology; explanation-first pattern
Lester SolbakkenUDCG advocacy; dynamic-k and abstention design
Tao RuangyamLLM-as-judge production pipeline at Zalando; NER-based test generation

Articles by Topic

Search Evaluation Philosophy

Metrics Deep Dives

Judgment & Annotation

Query Understanding


Cross-Section Connections

Query Understanding ←→ Agentic Search
(agent must understand multi-turn intent)

Judgment Lists ←→ LLM as Judge
(LLM automates what humans do manually)

Session-Based Evaluation ←→ Click Signals
(session behavior is the ground truth)

NDCG ←→ Retrieval Pipeline quality
(NDCG@10 is the standard retrieval benchmark)

UDCG ←→ Clean Context
(UDCG measures it; Clean Context achieves it)

Relation to Embeddings & Retrieval

The evaluation framework in this section applies to measuring the concepts from the Embeddings section:

  • Matryoshka Embeddings — measured by NDCG at various dimensions
  • SPLADE / ELSER — “17% NDCG@10 improvement over BM25”
  • ColBERT — BEIR benchmark NDCG@10
  • RAG — evaluated by faithfulness + relevancy (related to NDCG + recall)
  • SIRA — wins Recall@10 and NDCG@10 on BEIR with LLM-enriched BM25 queries

See MOC - Agentic Search and Embeddings for the retrieval section.

Additional Articles — Evaluation & Metrics

Additional Articles — Query Understanding & Facets