Search Quality Assurance
The practice of measuring, monitoring, and improving search quality systematically. SQA answers: “Is search getting better or worse, and how do we know?”
Evaluation Paradigms
Session-Based Evaluation
Measures outcomes at the session level — did the user accomplish their task?
- Signals: task completion, session abandonment, zero reformulations
- Strength: captures multi-query behavior; closer to business outcomes
- Weakness: harder to attribute to a specific ranking change; requires more data
Query-Based Evaluation
Measures quality per individual query, averaged over a query set.
- Signals: NDCG, MRR, MAP, P@K on a labeled sample
- Strength: fast, targeted, actionable; easy to diff between systems
- Weakness: ignores session context; quality of eval depends entirely on the query sample
Most offline evaluation is query-based. The query sample construction is critical — see Query Sampling.
Building the Query Sample
The quality of your evaluation is only as good as your query sample. A biased sample gives misleading metrics.
| Method | When to use |
|---|---|
| Random sampling | Baseline; unbiased but over-represents rare queries |
| Stratified sampling | Sample separately by query type, frequency tier, or domain — ensures head/torso/tail are all represented |
| PPS (Probability-Proportional-to-Size) | Weight sample by traffic volume — head queries get more labels, but tail queries still appear proportionally |
See Query Sampling for full details.
Metrics
Ranking Quality
Measure whether the right results are ranked at the top.
| Metric | What it captures | Best for |
|---|---|---|
| NDCG | Graded relevance, discounted by position | Multi-grade labels, long result lists |
| MRR | Position of first relevant result | Navigational queries, single-answer tasks |
| MAP | Precision at each relevant result, averaged | Binary relevance, recall-sensitive tasks |
| Precision and Recall | Coverage and accuracy tradeoff | Filtering tasks, document retrieval |
| P@K | Precision in top-K results | Quick sanity check on top of SERP |
Diversity Metrics
Measure whether results cover different aspects of the query intent.
| Metric | What it captures |
|---|---|
| MMR (Maximal Marginal Relevance) | Penalizes redundant results; balances relevance and novelty |
| APD (Average Pairwise Distance) | Average dissimilarity across all result pairs |
| Entropy | Distribution of result categories/topics |
| Diversity Metrics | Overview of diversity measurement approaches |
Behavioral / Product Metrics
Derived from real user interactions — no labels needed, but noisier.
| Metric | What it captures |
|---|---|
| CTR | Click-through rate; engagement with results |
| Zero clicks | Queries where no result was clicked — possible failure or direct answer |
| Clicks Residual | Gap between expected and observed click distribution — success signal at query level |
| Zero Results | Queries returning no results — direct coverage failure |
| Query reformulation rate | Users who rephrase after seeing results — dissatisfaction signal |
Evaluation Modes
Offline Evaluation
Evaluate a system against a labeled dataset without user traffic.
Advantages: fast, cheap, reproducible, safe (no user exposure to bad systems) Limitations: labels may not reflect current user intent; doesn’t capture behavioral effects
Requires Judgment Lists. Three ways to produce labels:
- Human judgments — annotators rate query-document pairs; gold standard but slow and expensive
- Implicit judgments — derive labels from click logs, dwell time, purchases; fast but biased by position and existing ranking
- LLM as judge — use an LLM to produce relevance ratings; scalable but requires calibration against human labels (see LLM as Judge)
Online Evaluation
Measure the impact of a change on real users via controlled experiments.
See A-B Testing for Search for the full treatment: interleaving, A/B splits, metric selection, statistical considerations.
Evaluation Cadence
| Type | Frequency | Purpose |
|---|---|---|
| Automated offline eval | Every code change (CI) | Catch regressions before deployment |
| Structured offline eval | Weekly/sprint | Track trends; inform roadmap |
| Human annotation refresh | Quarterly | Keep judgment set current with product changes |
| A/B test | Per significant change | Confirm offline wins translate to user behavior |
| Metric review with stakeholders | Monthly | Align on business-level quality definition |
Common Failure Modes
Evaluating against stale labels. Judgment sets age. A 2-year-old label set reflects old product, old catalog, old user intent. Build annotation refresh into your process.
Single-metric tunnel vision. Optimizing NDCG alone can hurt diversity, freshness, or zero-results rate. Define a primary metric but track a dashboard of secondaries with guardrails.
Head query bias. Teams manually review their top-50 queries and call it evaluation. The long tail (60-80% of volume) determines real user experience.
Conflating offline and online results. An offline improvement that doesn’t show up in A/B testing isn’t necessarily a bad metric — it may reveal a flawed label set. Investigate the gap rather than dismissing one or the other.
Public Evaluation Datasets
Open datasets provide ready-made judgment lists for offline evaluation. Common in e-commerce:
| Dataset | Domain | Scale | Labels |
|---|---|---|---|
| Amazon ESCI Dataset | General e-commerce | Very large | 4-class ESCI |
| ESCI-S Dataset | General e-commerce | ESCI + metadata | 4-class ESCI |
| WANDS Dataset | Home goods | ~42K pairs | 3-class |
| Home Depot Product Search Relevance | Home improvement | ~74K pairs | Continuous 1–3 |
Use these to: benchmark models before investing in custom annotation, calibrate annotator guidelines, and run cross-domain transfer experiments.
Related
- A-B Testing for Search — online evaluation in depth
- Judgment Lists — how to build and maintain a label set
- Query Sampling — how to build a representative query set
- Managing a Search Team — evaluation culture as a team practice
- LLM as Judge — scaling annotation with LLMs
LLM Judge Articles
- What AI Engineers Should Know about Search — Doug Turnbull; broad primer covering judgment methodology, evaluation biases, pointwise/pairwise/listwise paradigms
- Search Quality Assurance with AI as a Judge — Tao Ruangyam; Zalando production pipeline; NER query clustering; multi-language; pre-launch validation
- Classic ML to Cope with Dumb LLM Judges — Doug Turnbull; treating per-attribute LLM outputs as ML features for a decision tree
- Improving retrieval with LLM-as-a-judge — Jo Kristian Bergum; Vespa retrieval benchmarking with LLM judge