Search Quality Assurance

The practice of measuring, monitoring, and improving search quality systematically. SQA answers: “Is search getting better or worse, and how do we know?”

Evaluation Paradigms

Session-Based Evaluation

Measures outcomes at the session level — did the user accomplish their task?

Signals: task completion, session abandonment, zero reformulations
Strength: captures multi-query behavior; closer to business outcomes
Weakness: harder to attribute to a specific ranking change; requires more data

See Session-Based Evaluation.

Query-Based Evaluation

Measures quality per individual query, averaged over a query set.

Signals: NDCG, MRR, MAP, P@K on a labeled sample
Strength: fast, targeted, actionable; easy to diff between systems
Weakness: ignores session context; quality of eval depends entirely on the query sample

Most offline evaluation is query-based. The query sample construction is critical — see Query Sampling.

Building the Query Sample

The quality of your evaluation is only as good as your query sample. A biased sample gives misleading metrics.

Method	When to use
Random sampling	Baseline; unbiased but over-represents rare queries
Stratified sampling	Sample separately by query type, frequency tier, or domain — ensures head/torso/tail are all represented
PPS (Probability-Proportional-to-Size)	Weight sample by traffic volume — head queries get more labels, but tail queries still appear proportionally

See Query Sampling for full details.

Metrics

Ranking Quality

Measure whether the right results are ranked at the top.

Metric	What it captures	Best for
NDCG	Graded relevance, discounted by position	Multi-grade labels, long result lists
MRR	Position of first relevant result	Navigational queries, single-answer tasks
MAP	Precision at each relevant result, averaged	Binary relevance, recall-sensitive tasks
Precision and Recall	Coverage and accuracy tradeoff	Filtering tasks, document retrieval
P@K	Precision in top-K results	Quick sanity check on top of SERP

Diversity Metrics

Measure whether results cover different aspects of the query intent.

Metric	What it captures
MMR (Maximal Marginal Relevance)	Penalizes redundant results; balances relevance and novelty
APD (Average Pairwise Distance)	Average dissimilarity across all result pairs
Entropy	Distribution of result categories/topics
Diversity Metrics	Overview of diversity measurement approaches

Behavioral / Product Metrics

Derived from real user interactions — no labels needed, but noisier.

Metric	What it captures
CTR	Click-through rate; engagement with results
Zero clicks	Queries where no result was clicked — possible failure or direct answer
Clicks Residual	Gap between expected and observed click distribution — success signal at query level
Zero Results	Queries returning no results — direct coverage failure
Query reformulation rate	Users who rephrase after seeing results — dissatisfaction signal

Evaluation Modes

Offline Evaluation

Evaluate a system against a labeled dataset without user traffic.

Advantages: fast, cheap, reproducible, safe (no user exposure to bad systems) Limitations: labels may not reflect current user intent; doesn’t capture behavioral effects

Requires Judgment Lists. Three ways to produce labels:

Human judgments — annotators rate query-document pairs; gold standard but slow and expensive
Implicit judgments — derive labels from click logs, dwell time, purchases; fast but biased by position and existing ranking
LLM as judge — use an LLM to produce relevance ratings; scalable but requires calibration against human labels (see LLM as Judge)

Online Evaluation

Measure the impact of a change on real users via controlled experiments.

See A-B Testing for Search for the full treatment: interleaving, A/B splits, metric selection, statistical considerations.

Evaluation Cadence

Type	Frequency	Purpose
Automated offline eval	Every code change (CI)	Catch regressions before deployment
Structured offline eval	Weekly/sprint	Track trends; inform roadmap
Human annotation refresh	Quarterly	Keep judgment set current with product changes
A/B test	Per significant change	Confirm offline wins translate to user behavior
Metric review with stakeholders	Monthly	Align on business-level quality definition

Common Failure Modes

Evaluating against stale labels. Judgment sets age. A 2-year-old label set reflects old product, old catalog, old user intent. Build annotation refresh into your process.

Single-metric tunnel vision. Optimizing NDCG alone can hurt diversity, freshness, or zero-results rate. Define a primary metric but track a dashboard of secondaries with guardrails.

Head query bias. Teams manually review their top-50 queries and call it evaluation. The long tail (60-80% of volume) determines real user experience.

Conflating offline and online results. An offline improvement that doesn’t show up in A/B testing isn’t necessarily a bad metric — it may reveal a flawed label set. Investigate the gap rather than dismissing one or the other.

Public Evaluation Datasets

Open datasets provide ready-made judgment lists for offline evaluation. Common in e-commerce:

Dataset	Domain	Scale	Labels
Amazon ESCI Dataset	General e-commerce	Very large	4-class ESCI
ESCI-S Dataset	General e-commerce	ESCI + metadata	4-class ESCI
WANDS Dataset	Home goods	~42K pairs	3-class
Home Depot Product Search Relevance	Home improvement	~74K pairs	Continuous 1–3

Use these to: benchmark models before investing in custom annotation, calibrate annotator guidelines, and run cross-domain transfer experiments.

A-B Testing for Search — online evaluation in depth
Judgment Lists — how to build and maintain a label set
Query Sampling — how to build a representative query set
Managing a Search Team — evaluation culture as a team practice
LLM as Judge — scaling annotation with LLMs

LLM Judge Articles

What AI Engineers Should Know about Search — Doug Turnbull; broad primer covering judgment methodology, evaluation biases, pointwise/pairwise/listwise paradigms
Search Quality Assurance with AI as a Judge — Tao Ruangyam; Zalando production pipeline; NER query clustering; multi-language; pre-launch validation
Classic ML to Cope with Dumb LLM Judges — Doug Turnbull; treating per-attribute LLM outputs as ML features for a decision tree
Improving retrieval with LLM-as-a-judge — Jo Kristian Bergum; Vespa retrieval benchmarking with LLM judge

Awesome Search KG

Explorer

Search Quality Assurance

Search Quality Assurance

Evaluation Paradigms

Session-Based Evaluation

Query-Based Evaluation

Building the Query Sample

Metrics

Ranking Quality

Diversity Metrics

Behavioral / Product Metrics

Evaluation Modes

Offline Evaluation

Online Evaluation

Evaluation Cadence

Common Failure Modes

Public Evaluation Datasets

LLM Judge Articles

Graph View

Table of Contents

Backlinks

Awesome Search KG

Explorer

Search Quality Assurance

Search Quality Assurance

Evaluation Paradigms

Session-Based Evaluation

Query-Based Evaluation

Building the Query Sample

Metrics

Ranking Quality

Diversity Metrics

Behavioral / Product Metrics

Evaluation Modes

Offline Evaluation

Online Evaluation

Evaluation Cadence

Common Failure Modes

Public Evaluation Datasets

Related

LLM Judge Articles

Graph View

Table of Contents

Backlinks