Search Quality Assurance

The practice of measuring, monitoring, and improving search quality systematically. SQA answers: “Is search getting better or worse, and how do we know?”


Evaluation Paradigms

Session-Based Evaluation

Measures outcomes at the session level — did the user accomplish their task?

  • Signals: task completion, session abandonment, zero reformulations
  • Strength: captures multi-query behavior; closer to business outcomes
  • Weakness: harder to attribute to a specific ranking change; requires more data

See Session-Based Evaluation.

Query-Based Evaluation

Measures quality per individual query, averaged over a query set.

  • Signals: NDCG, MRR, MAP, P@K on a labeled sample
  • Strength: fast, targeted, actionable; easy to diff between systems
  • Weakness: ignores session context; quality of eval depends entirely on the query sample

Most offline evaluation is query-based. The query sample construction is critical — see Query Sampling.


Building the Query Sample

The quality of your evaluation is only as good as your query sample. A biased sample gives misleading metrics.

MethodWhen to use
Random samplingBaseline; unbiased but over-represents rare queries
Stratified samplingSample separately by query type, frequency tier, or domain — ensures head/torso/tail are all represented
PPS (Probability-Proportional-to-Size)Weight sample by traffic volume — head queries get more labels, but tail queries still appear proportionally

See Query Sampling for full details.


Metrics

Ranking Quality

Measure whether the right results are ranked at the top.

MetricWhat it capturesBest for
NDCGGraded relevance, discounted by positionMulti-grade labels, long result lists
MRRPosition of first relevant resultNavigational queries, single-answer tasks
MAPPrecision at each relevant result, averagedBinary relevance, recall-sensitive tasks
Precision and RecallCoverage and accuracy tradeoffFiltering tasks, document retrieval
P@KPrecision in top-K resultsQuick sanity check on top of SERP

Diversity Metrics

Measure whether results cover different aspects of the query intent.

MetricWhat it captures
MMR (Maximal Marginal Relevance)Penalizes redundant results; balances relevance and novelty
APD (Average Pairwise Distance)Average dissimilarity across all result pairs
EntropyDistribution of result categories/topics
Diversity MetricsOverview of diversity measurement approaches

Behavioral / Product Metrics

Derived from real user interactions — no labels needed, but noisier.

MetricWhat it captures
CTRClick-through rate; engagement with results
Zero clicksQueries where no result was clicked — possible failure or direct answer
Clicks ResidualGap between expected and observed click distribution — success signal at query level
Zero ResultsQueries returning no results — direct coverage failure
Query reformulation rateUsers who rephrase after seeing results — dissatisfaction signal

Evaluation Modes

Offline Evaluation

Evaluate a system against a labeled dataset without user traffic.

Advantages: fast, cheap, reproducible, safe (no user exposure to bad systems) Limitations: labels may not reflect current user intent; doesn’t capture behavioral effects

Requires Judgment Lists. Three ways to produce labels:

  • Human judgments — annotators rate query-document pairs; gold standard but slow and expensive
  • Implicit judgments — derive labels from click logs, dwell time, purchases; fast but biased by position and existing ranking
  • LLM as judge — use an LLM to produce relevance ratings; scalable but requires calibration against human labels (see LLM as Judge)

Online Evaluation

Measure the impact of a change on real users via controlled experiments.

See A-B Testing for Search for the full treatment: interleaving, A/B splits, metric selection, statistical considerations.


Evaluation Cadence

TypeFrequencyPurpose
Automated offline evalEvery code change (CI)Catch regressions before deployment
Structured offline evalWeekly/sprintTrack trends; inform roadmap
Human annotation refreshQuarterlyKeep judgment set current with product changes
A/B testPer significant changeConfirm offline wins translate to user behavior
Metric review with stakeholdersMonthlyAlign on business-level quality definition

Common Failure Modes

Evaluating against stale labels. Judgment sets age. A 2-year-old label set reflects old product, old catalog, old user intent. Build annotation refresh into your process.

Single-metric tunnel vision. Optimizing NDCG alone can hurt diversity, freshness, or zero-results rate. Define a primary metric but track a dashboard of secondaries with guardrails.

Head query bias. Teams manually review their top-50 queries and call it evaluation. The long tail (60-80% of volume) determines real user experience.

Conflating offline and online results. An offline improvement that doesn’t show up in A/B testing isn’t necessarily a bad metric — it may reveal a flawed label set. Investigate the gap rather than dismissing one or the other.


Public Evaluation Datasets

Open datasets provide ready-made judgment lists for offline evaluation. Common in e-commerce:

DatasetDomainScaleLabels
Amazon ESCI DatasetGeneral e-commerceVery large4-class ESCI
ESCI-S DatasetGeneral e-commerceESCI + metadata4-class ESCI
WANDS DatasetHome goods~42K pairs3-class
Home Depot Product Search RelevanceHome improvement~74K pairsContinuous 1–3

Use these to: benchmark models before investing in custom annotation, calibrate annotator guidelines, and run cross-domain transfer experiments.


LLM Judge Articles